ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Available online at www.sciencedirect.com
Journal of the Franklin Institute xxx (xxxx) xxx www.elsevier.com/locate/jfranklin
Primal-dual stochastic distributed algorithm for constrained convex optimization Youcheng Niu a, Haijing Wang a, Zheng Wang a, Dawen Xia b, Huaqing Li a,∗ a Chongqing
Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, College of Electronic and Information Engineering, Southwest University, Chongqing 400715, PR China b College of Data Science and Information Engineering, Guizhou Minzu University, Guiyang 550025, PR China Received 27 February 2019; received in revised form 6 June 2019; accepted 19 July 2019 Available online xxx
Abstract This paper investigates distributed convex optimization problems over an undirected and connected network, where each node’s variable lies in a private constrained convex set, and overall nodes aim at collectively minimizing the sum of all local objective functions. Motivated by a variety of applications in machine learning problems with large-scale training sets distributed to multiple autonomous nodes, each local objective function is further designed as the average of moderate number of local instantaneous functions. Each local objective function and constrained set cannot be shared with others. A primaldual stochastic algorithm is presented to address the distributed convex optimization problems, where each node updates its state by resorting to unbiased stochastic averaging gradients and projects on its private constrained set. At each iteration, for each node the gradient of one local instantaneous function selected randomly is evaluated and the average of the most recent stochastic gradients is used to approximate the true local gradient. In the constrained case, we show that with strong-convexity of the local instantaneous function and Lipschitz continuity of its gradient, the algorithm converges to the global optimization solution almost surely. In the unconstrained case, an explicit linear convergence rate of the algorithm is provided. Numerical experiments are presented to demonstrate correctness of the theoretical results. © 2019 The Franklin Institute. Published by Elsevier Ltd. All rights reserved. Keywords: Constrained convex optimization; Machine learning; Primal-dual algorithm; Stochastic averaging gradients; Linear convergence.
∗
Corresponding author. E-mail address:
[email protected] (H. Li).
https://doi.org/10.1016/j.jfranklin.2019.07.018 0016-0032/© 2019 The Franklin Institute. Published by Elsevier Ltd. All rights reserved.
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
2
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
1. Introduction Distributed convex optimization problems recently have received much attention and been extensively investigated in control system [1–3], sensor network [4–6], machine learning [7–9], source localization [10,11], and wireless system [12–14], to name a few. In these scenarios, there is no node or agent that acts as the fusion center in a network, and the tasks including computation and data transmission are distributed over all the nodes. In such a distributed manner, the system can avoid the risk of network congestion and computation overload. Generally, in distributed convex optimization the autonomous networked nodes cooperatively minimize a global objective function, given by the sum of each node’s private local objective functions, via the communication with their neighbors only. In this paper, we focus on the case that each node’s variable is constrained to a closed convex set and each constrained set cannot be shared with other nodes. In addition, the average of a number of local instantaneous functions is set as the local objective function. This setup is motivated by a variety of applications in machine learning with large-scale training sets [36], e.g., logistic regression and reinforcement learning, where to improve training speed by exploiting parallel computing, the training sets are partitioned into multiple computational nodes. Literature review. There are a few of work in distributed convex optimization. Early typical work comprises the distributed gradient descent (DGD) [15], where at each iteration, each node averages its local solution by utilizing the information of neighbors, and then performs its local gradient-descent update. DGD-based methods are widely applied to solving the distributed convex optimization problems with nodes having limited computation capabilities for the sake of the simplicity of calculation. To realize an accurate convergence, the appropriate selection of the step-size is required in DGD-based methods. For instance, DGD-based methods with a diminishing step-size converge to an accurate solution at a sub√ k ). With the help of linear rate for arbitrary convex local objective functions [15], i.e., O( ln k strong-convexity on the local objective functions, the methods here could be improved to the convergence rate of O( lnkk ). In fact, the employment of the diminishing step-size for DGD-based methods leads to the slower convergence rate. Especially, with strong-convexity of the local objective function and Lipschitz continuity of its gradient, DGD-based methods are able to accelerate to linear convergence rate by utilizing a constant step-size [16], whereas they expose the deficiency of inaccurate convergence. Related work on DGD-based methods also includes extensions over directed and time-varying network [18–20]. For the purpose of achieving the accurate convergence with a faster rate, a distributed algorithm [17] is proposed on the basis of Nesterov gradient methods, however at each iteration nodes require excessive consensus computation. Recently, Qu et al. investigate an efficient distributed optimization algorithm by harnessing the function smoothness and thereby achieve a linear convergence rate. To achieve fast convergence, some efficient algorithms based on augmented Lagrangian functions are developed [21–24]. A typical method is decentralized Alternating Direction Method of Multipliers (ADMM) [21]. Then, various distributed versions of ADMM are proposed [22,23], where the linear convergence rate is proved for the strongly convex local objective function with exploiting the constant step-size. However, distributed ADMM suffers from high computation burden since at each iteration each node is required to address a local optimization problem. To overcome these issues, the linearized ADMM algorithm [24] is proposed by linearizing the local objective functions. Shi et al. [25] develop an exact first-order Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
3
algorithm (EXTRA), which could realize a linear convergence rate by using the difference of two consecutive DGD steps. The abovementioned algorithms [15,17,22–25] only consider the unconstrained case, where each node’s variable is free to lie in the Euclidean space. The primary interest of this paper is on the constrained convex optimization problem with a global constraint being an intersection of each node’s private constrained convex sets. Related literature on the constrained case is vast [26–33]. The primal-dual (sub)gradient method for addressing the constrained optimization problem is proposed by Nedíc and Ozdaglar [26], which is essentially the gradient algorithm for finding a saddle point of the augmented Lagrangian function. A distributed constrained problem with the global inequality constraint is considered in [27,28]. When the inequality constraint is complex, it may lead to high computation burden. Lei et al. [29] investigate a distributed convex optimization problem with locally known constrained convex sets for the smooth objective function. The case when the constrained convex set could be known by all networked nodes is further studies in [30–32]. Particularly, Lee and Nedíc in [33] take random constrained set selections into account. It should be pointed out that above algorithms [29–33] deal with the set constraints by performing projection on the constrained sets. It is worth noting that algorithms [15,17,22–25,29] require evaluating true gradients of local objective functions at each iteration, in the sense that the practicality of these algorithms is hindered in high-dimensional and large-scale problems due to high computation cost of local gradients. In order to keep computation simplicity, stochastic gradient descent methods (SGD) are hence proposed as an alternative, where the gradient of only one local objective function selected randomly is updated at each iteration. However, a major shortcoming is that the variance in the stochastic gradient estimates converges to zero only if the methods employ the diminishing step-size, which thus leads to a sublinear convergence rate. Utilizing variance-reduced techniques, stochastic averaging gradients (SAG) [34] and unbiased stochastic averaging gradients (SAGA) [35] are proposed, and they can achieve an accurate convergence with a constant step-size. Specifically, at each iteration SAGA utilizes the average of the most recent stochastic local gradients as an approximation of the true gradient of the global function. Recently, stochastic gradient methods have been intensively investigated in [36–40]. Contributions. The insightful work [29,36] are mostly related to this paper. A constrained convex optimization problem with each node’s private constrained closed convex sets is discussed in [29] by utilizing true local gradients, which causes costly evaluation of the gradients in large-scale problems. To reduce the cost of evaluation of the gradients, Mokhtari and Ribeiro [36] employ unbiased stochastic averaging gradients into EXTRA for a machine learning problem, in which the average of a number of local instantaneous functions is assumed as the local objective function in each node. Unlike [35], in [36] the average of most recent stochastic local instantaneous function gradients is utilized to approximate the gradients of local objective function. However, the results in [36] can only be applied to unconstrained convex optimization problem. In this paper, on the basis of local unbiased stochastic averaging gradients introduced by Mokhtari and Ribeiro [36], we present a novel primal-dual stochastic distributed algorithm for the constrained convex optimization problem that each node’s variable is subject to a private constrained closed convex set. The proposed algorithm can significantly reduce the evaluation cost of gradients of local objective functions. With strong-convexity of the local instantaneous function and Lipschitz continuity of its gradient, we establish the convergence to the global optimization solution with a constant step-size. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
4
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Besides, we provide a specific upper bound for the selection of step-size. In addition, when each node’s variable is free to lie in the Euclidean space, the presented algorithm is proved to possess a linear convergence rate. Organization. The remainder structure of the paper is as follows. Section 2 formulates the original constrained optimization problem, and provides the reformulation for the problem and necessary assumptions. Section 3 introduces existing primal-dual distributed algorithm and unbiased stochastic averaging gradients. A primal-dual stochastic distributed algorithm is presented in Section 4. Section 5 includes the supporting lemmas and main convergence results. Numerical experiments on logistic regression problems are conducted in Section 6, and finally the conclusion is given in Section 7. Notation. We write 1m as an m-dimensional vector with all entries being one. We view notation as the Kronecker product. Denoted by x the Euclidean norm of a vector x. Given a vector v and √ a symmetric positive semi-definite matrix A, we represent the A-weighted norm as vA = vT Av. Define λmax (A) and λmin (A) as the largest and smallest eigenvalues of matrix A, respectively. We use PXi [v] to stand for projecting the vector v on a convex set Xi . The notation E[·] represents the expectation with respect to a stochastic process.
2. Problem formulation Consider a network consisting of m nodes over an undirected and connected graph G = {V, E, A}, where V = {1, . . . , m} is collection of nodes, E ⊂ V × V is collection of edges, and A = [ai j ] ∈ Rm×m represents the adjacency matrix satisfying ai j = a ji > 0 if (i, j) ∈ E, and ai j = 0 if (i, j) ∈ / E. We let Ni = { j |(i,j ) ∈ E} to indicate the neighborhood set of node i. Denote a diagonal matrix as D = diag{ mj=1 a1 j , . . . , mj=1 am j }. Besides, the Laplacian matrix L of graph G is defined as L = D − A which is a symmetric and semi-definite matrix satisfing L1m = 0. Specially, we focus on an undirected and connected network, where m nodes cooperatively address the constrained convex optimization problem as: minn f˜(x˜ ) = x˜∈R
s.t. x˜ ∈ X˜ =
m
fi (x˜ ) =
i=1 m
Xi
qi m 1 j fi (x˜ ) q i=1 i j=1
(1)
i=1
where x˜ ∈ X˜ is a global variable, in each node local objective function fi : Rn → R, i ∈ V, is defined as the average of qi local instantaneous functions fi j : Rn → R, j = 1, ..., qi , and Xi ⊂ Rn is a closed convex set of node i. Each fi and Xi are only possessed by node i, and cannot be shared with other nodes. Denote x˜∗ as a global optimization solution to problem (1). In problem (1), m nodes try to minimize the global objective function f˜(x˜ ) in the constrained set X˜ that is an intersection of m private constrained Xi while each node can communicate with its neighbors only. Problem (1) can be applied to scenarios with machine learning mproblems where to obtain the optimal classifier x˜∗ , a large training set with a total of i=1 qi training samples is distributed in m nodes for parallel processing conduct [36]. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
5
2.1. Problem reformulation Lemma 1 [43]. Problem (1) can be rewritten as the following equivalent form.
minmn f (x) =
x∈R
m
f i ( xi ) =
i=1
qi m 1 j f i ( xi ) q i=1 i j=1
s.t.Lx = 0, x ∈ X
(2)
where x = [x1T , . . . , xmT ]T , L = L In , and X = m i=1 Xi is the Cartesian product. If x ∗ = [x1∗ , . . . , xm∗ ] is a global optimization solution to problem (2), we have xi∗ = x ∗j = x˜∗ , ∀i, j ∈ V by L x ∗ = 0. We next make two assumptions which are necessary in later analysis. Assumption 1. Each local instantaneous function fi j is strongly convex with parameter μ, i.e., for all i ∈ V and j ∈ {1, . . . , qi }, we have T ∇ fi j (a) − ∇ fi j (b) (a − b) ≥ μa − b2 , ∀a, b ∈ Rn (3) Following Assumption 1, it follows that fi , i ∈ V and f are also strongly convex with parameter μ. Assumption 2. The gradient of each local instantaneous function ∇ fi j is Lipschitz-continuous with parameter Lf , i.e., for all i ∈ V and j ∈ {1, . . . , qi }, we have j j (4) ∇ fi (a) − ∇ fi (b) ≤ L f a − b, ∀a, b ∈ Rn Likewise, from Assumption 2, it is known that the gradients of fi , i ∈ V and f are also Lipschitz-continuous with parameter Lf . Define the Lagrangian function as: La (x, λ) = f (x ) + λT Lx
(5)
where λ ∈ R is the Lagrangian multiplier vector. Then the primal problem of problem (2) can be written as: nm
min max La (x, λ) nm x∈X λ∈R
(6)
where its dual problem is: max min La (x, λ)
λ∈Rnm x∈X
(7)
Lemma 2 [41]. A saddle point (x∗ , λ∗ ) ∈ X × Rnm is considered as a primal-dual solution pair to problems (6) and (7) when the following condition is satisfied La (x ∗ , λ) ≤ La (x ∗ , λ∗ ) ≤ La (x, λ∗ )
(8)
where x and λ represent the primal and dual variables, respectively. Remark 1. According to Lemma 2, when (x∗ , λ∗ ) is a saddle point of La (x, λ), it is known that x∗ is a primal optimization solution to problem (6), equivalently, a global optimization solution to problem (2). Thus, we can address distributed optimization problem (2) by resorting to a saddle method. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
6
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Remark 2. In the later analysis, for the sake of simplicity, we suppose that each node’s dimension is one, i.e., xi ∈ R, i ∈ V. We could extend the obtained results to high-dimensional case by utilizing the Kronecker product. 3. Relative algorithms In this section, we introduce existing primal-dual distributed algorithms for problem (2), and recall unbiased stochastic averaging gradients. 3.1. Primal-dual distributed algorithm A primal-dual distributed algorithm for problem (2) is proposed by Lei et al. [29] on the basis of the augmented lagrangian function. Let xki and λik be the local estimates for the primal and dual variables of node i at time k, respectively. In [29], at time k the variables xki and λik are updated as ⎡ m i (9) xk+1 = PXi ⎣xki − α∇ fi (xki ) − α ai j xki − xkj j=1
−α
m
⎤
ai j λik − λkj ⎦
j=1
λik+1 = λik + α
m
j i ai j xk+1 − xk+1
(10)
j=1
where α represents a constant step-size, and ∇ fi (xki ) gives the gradient of fi at xki . In essence, algorithms (9) and (10) could be viewed as a gradient-decent algorithm for finding the primaldual solution pair (so called a saddle point) of the augmented Lagrangian function L˜ a (x, λ) = La (x , λ) + 21 x T L x . Note that fi is set as the average of qi local instantaneous functions available at node i. At time k, updating Eq. (9) requires that each node i evaluates the true gradient of fi as follows: ∇
fi (xki )
qi 1 = ∇ f j (x i ) qi j=1 i k
(11)
If qi is large, implementing Eq. (11) is computationally expensive, which can be avoided by using unbiased stochastic averaging gradients to substitute for local objective function gradients. 3.2. Unbiased stochastic averaging gradients Define χik ∈ {1, . . . , qi } as the function index of node i, which is selected uniformly at j random at time k. Let variable yi,k be the estimation of the variable of the instantaneous j j function fi by node i at time k, and update variable yi,k as follows:
j yi,k+1 = xki , if i = χik (12) j j yi,k+1 = yi,k , if i = χik Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
7
The unbiased stochastic averaging gradient of node i at time k is denoted as gik
=∇
χk fi i (xki )
χk χk fi i (yi,ki )
−∇
qi 1 + ∇ f j (y j ) qi j=1 i i,k
(13)
It should be pointed out that all local instantaneous function gradients in a node are stored in a table. At each iteration, the gradient of only one randomly selected local instantaneous function node is updated and all others remain unchanged. Note that computing the i in each j sum qj=1 ∇ fi j (yi,k ) in Eq. (13) at time k is also costly. To overcome this deficiency, we can qi j j update the sum j=1 ∇ f i (yi,k ) recursively as follows: qi
χik−1
j ∇ fi j (yi,k ) = ∇ fi
χik−1
(xki −1 ) − ∇ fi
χ k−1
i (yi,k−1 )+
qi
j=1
j ∇ fi j (yi,k−1 )
(14)
j=1
which provides a computationally efficient way by storing its last value. Let Fk measure the history of the system up until time k. Lemma 3 [36]. The stochastic averaging gradient gik is an unbiased estimate of ∇ fi (xki ), i = 1, . . . , m, i.e., E gi Fk = ∇ fi (x i ) (15) k
k
4. Primal-dual stochastic distributed algorithm In this section, a primal-dual stochastic distributed algorithm is proposed to address problem (2). Aiming at reducing the computation burden of local gradients, we employ the unbiased stochastic averaging gradients to substitute for local objective function gradients. Based on the discussion above, it is known that we can solve problem (2) by finding a saddle point of the augmented Lagrangian function. We develop the primal-dual stochastic distributed algorithm on the basis of algorithms (9) and (10) by utilizing unbiased stochastic averaging gradients (13) for problem (2) as follows: ⎡ m i (16) xk+1 =PXi ⎣xki − αgi (xki ) − α ai j xki − xkj j=1
−α
m
ai j
⎤ λik − λkj ⎦
j=1
λik+1 =λik + α
m
j i ai j xk+1 − xk+1
(17)
j=1
The pseudo-code implementation of algorithms (16) and (17) is presented in Algorithm 1. T m 1 m T m Let xk = [xk1 , . . . , xkm ]T ∈ Rm , λ = [λ1k , . . . , λm k ] ∈ R , and gk = [gk , . . . , gk ] ∈ R . Recalling the definition of L, algorithms (16) and (17) can be simplified as: (18) xk+1 = PX xk − αgk − αL xk − αL λk λk+1 = λk + αL xk+1
(19)
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
8
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Algorithm 1 Primal-dual stochastic distributed algorithm for constrained convex optimization. 1: Given the initial variables x0i ∈ Xi and λi0 = 0, the tables of the local gradients are initialized with ∇ fi j (yi,j 0 ), yi,j 0 = x0i for j ∈ {1, . . . , qi }, i = 1, . . ., m. 2: for k = 0, 1, 2. . . do 3: Select χik uniformly at random from {1, . . ., qi }. 4: Compute the sum of the local gradients, and store. 5: if k = 0 then q i qi j j j j 6: j=1 ∇ f i (yi,k )= j=1 ∇ f i (yi,0 ) 7: else qi i χik−1 i χ k−1 χik−1 j j j 8: (xk −1 )−∇ fi i (yi,k−1 )+ qj=1 ∇ fi j (yi,k−1 ) j=1 ∇ f i (yi,k )=∇ f i 9: end if 10: Compute stochastic averaging gradients according to: i χk χk χk j gik = ∇ fi i (xki ) − ∇ fi i (yi,ki ) + 1/qi qj=1 ∇ fi j (yi,k ) χk
χk
χk
χk
i i Take yi,k+1 = xki and store ∇ fi i (yi,k+1 ) = ∇ fi i (xki ) in the tables of the local gradients. j All other entries in the table remain unchanged. In fact, variable yi,k is not explicitly stored. 12: Update allthe primal variables as: i xk+1 = PXi xki − αgik − α mj=1 ai j xki − xkj −α mj=1 ai j λik − λkj 13: Update all the dual variables as: m j i i i λk+1 = λk + α j=1 ai j xk+1 − xk+1 14: end for
11:
Denote zk+1 as the projection error in Eq. (18), i.e., zk+1 = PX xk − αgk − αL xk − αL λk − xk − αgk − αL xk − αL λk
(20)
By Eq. (20) and L x ∗ = 0, we can derive xk+1 − x ∗ = PX xk − αgk − αL xk − αL λk − x ∗ = W xk − x ∗ − αL λk − λ∗ − αgk − αL λ∗ + zk+1
(21)
where W = I − αL, and (xx , λ∗ ) is a saddle point of La (x, λ) in Eq. (5). From Eq. (19), we have λk+1 − λ∗ = λk − λ∗ + αL xk+1
(22)
In addition, we define an auxiliary function as follows: 2 2 V (xk , λk ) = xk − x ∗ W + λk − λ∗
(23)
In next section, we will give the convergence analysis for algorithms (18) and (19) by applying Lyapunov function method. 5. Convergence analysis In this section, we proceed with analyzing the convergence of proposed algorithms (18) and (19). Before showing the convergence results, we first provide some supporting lemmas. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
9
5.1. Supporting lemmas Define an auxiliary variable pk as follows: qi m 1 j j j fi (yi,k ) − ∇ fi j (x˜∗ ) − ∇ fi j (x˜∗ )T (yi,k pk = − x˜∗ ) q i=1 i j=1
(24)
j Since all fi j , i = 1, . . . , m, j ∈ V are strongly convex, we have that fi j (yi,k ) − ∇ fi j (x˜∗ ) − j − x˜∗ ) is non-negative, which implies that variable pk is also non-negative. ∇ fi j (x˜∗ )T (yi,k In the following convergence analysis, we probe for an upper bound for the difference between the unbiased stochastic averaging gradient gk and the optimization gradient ∇f(x∗ ) in expectation respected to Fk at time k in terms of pk , i.e., E[gk − ∇ f (x ∗ )2 |Fk ].
Lemma 4 [36]. Suppose that Assumptions 1 and 2 hold. According to the definition of gk and algorithms (18) and (19), the squared norm of the difference between the unbiased stochastic averaging gradient gk and the optimization gradient ∇f(x∗ ) in expectation respected to Fk is bounded by: 2 T (25) E gk − ∇ f (x ∗ ) Fk ≤ 4L f pk + 2 2L f − μ f (xk ) − f (x ∗ ) − ∇ f (x ∗ ) xk − x ∗ The following lemma gives an upper bound for pk in expectation respected to Fk . Lemma 5 [36]. Suppose that Assumptions 1 and 2 hold. Based on the definition of pk and algorithms (18) and (19), for all k > 0 the variable pk satisfies T 1 1 pk + f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ E pk+1 |Fk ≤ 1 − qmax qmin
(26)
where qmin and qmax are the minimum and maximum number of local instantaneous functions in nodes over the network, respectively. 5.2. Main results Convergence results of algorithms (18) and (19) for problem (2) are presented in this subsection. Theorem 1. Suppose that Assumptions 1 and 2 hold. Recall the definition of pk and consider the sequence (xk , λk ) generated by algorithms (18) and (19). If parameters η, φ, and c satisfy the following conditions η>
2L 2f qmax
qmin γ (2μ − φ) 0 < φ < 2μ 4α L f qmax c> η and the constant step-size α satisfies 0<α<
1 λmax (L ) + η +
L 2f φ
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
10
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
where 0 < γ < 1, then we have (i) the Lyapunov function 2 V (xk , λk )+xk − x ∗ γ Q +c pk (27) where Q = 2 αL − α 2 L 2 − α(φ − 2μ)I , converges to zero almost surely. (ii) the sequence of all the local variables xki , i ∈ V, converges to a global optimization solution x˜∗ almost surely. Proof. See Appendix A. From Theorem 1, we have that the sequence (xk , λk ) generated by algorithms (18) and (19) converges to a saddle point (x∗ , λ∗ ). This implies that we solve the original constrained convex optimization problem (1) with algorithms (18) and (19). Next, we consider the situation when Xi = R, i ∈ V. In this case, problem (1) becomes an unconstrained convex optimization problem, and algorithms (18) and (19) can be simplified as: xk+1 = xk − αgk − αL xk − αL λk
(28)
λk+1 = λk + αL xk+1
(29)
Suppose that (xk∗ , λ∗k ) represents a saddle point of La (x, λ) in X × Rm . We will show that the sequence of global variable xk generated by algorithms (28) and (29) converges to a global optimization solution, x∗ , to problem (1) at a linear convergence rate. Theorem 2. Suppose that Assumptions 1 and 2 hold. Recall the definition of pk and consider the sequence (xk , λk ) generated by algorithms (28) and (29). If parameters η, φ, and c satisfy the following conditions η>
2L 2f qmax
qmin γ (2μ − φ) 0 < φ < 2μ 2qmin γ α 4α L f qmax < c < (2μ − φ) η Lf and the constant step-size α satisfies 0<α<
1 λmax (L ) + η +
L 2f φ
where 0 < γ < 1, then there exists a constant δ > 0 such that 2 2 2 (δ + 1 )E xk+1 − x ∗ W + λk+1 − λ∗ + xk+1 − x ∗ γ Q + c pk+1 Fk 2 2 2 ≤ xk − x ∗ W + λk − λ∗ + xk − x ∗ γ Q + c pk
(30)
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
11
Proof. See Appendix B From Theorem 2, it is showing that at time k + 1 the expected value of the Lyapunov function (27) respected to Fk is smaller than the previous value at time k. Thus, by Eq. (30) we have 2 2 2 E xk − x ∗ W + λk − λ∗ + xk − x ∗ γ Q + c pk Fk 2 2 2 (31) ≤ (1 + δ)−k x0 − x ∗ W + λ0 − λ∗ + x0 − x ∗ γ Q + c p0 2 which implies that in expectation the sequence xk − x ∗ W + λk − λ∗ 2 + xk − x ∗ 2γ Q + c pk −k converges to zero at a linear convergence rate O( (1 + δ) ). Observing that xk − x ∗ 2 ≤ xk − x ∗ 2 + λk − λ∗ 2 + xk − x ∗ 2 + c pk (32) W W γQ 2 and λmin (W )xk − x ∗ 2 ≤ xk − x ∗ W , we can derive
2 (1 + δ)−k x0 − x ∗ 2 + λ0 − λ∗ 2 + x0 − x ∗ 2 + c p0 E xk − x ∗ Fk ≤ W γQ λmin (W )
(33)
Remark 3. From Eq. (33), we derive that the sequence E[xk − x ∗ 2 |Fk ] also converges to zero at a linear convergence rate O( (1 + δ)−k ). Consequently, the sequence xk generated by algorithms (28) and (29) almost surely converges to a global optimization solution to problem (2) when Xi = R, i ∈ V, which implies that all variables xki , i ∈ V, converge to the global optimization solution x˜∗ , simultaneously. Remark 4. In [29], the primal-dual algorithm is proved to converge with rate O(1/k) for the unconstrained problem. As shown in Theorem 2, we prove that it can be accelerated to a linear rate with stochastic gradients under the strong convexity assumption on instantaneous functions. Compared with the exact first-order algorithm [25] and its stochastic version [36], we further consider the constrained cases and provide a concise convergence analysis based on the Lyapunov function method. 6. Numerical simulations In this section, the correctness of the theoretical results is verified through a distributed logistic regression problem. The simulations are conducted in an undirected connected network of m nodes with the probability of edges pc . Motivated by Mokhtari and Ribeiro [36], the local instantaneous function fi j : Rn → R, at node i, i = 1, . . . , m, j = {1, . . . , qi }, is given by: fi j (x˜ ) =
λ x˜2 + qi log 1 + exp −bi j siTj x˜ 2m
(34)
where sij ∈ Rn is the jth training sample at node i with associated labels bij ∈ { ± 1}, and the λ x˜2 is used to reduce overfitting to the training set. Besides, fi is regularization term 2m further set as the average of qi local instantaneous functions fi j , j ∈ {1, . . . , qi }, i.e., fi (x˜ ) = i 1/qi qj=1 fi j (x˜ ). The m nodes cooperatively address the following distributed optimization Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
12
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Fig. 1. The estimates of all nodes’ xki .
problem: ∗
x˜ = arg min x˜∈X˜
m
fi (x˜ )
i=1
where the constrained set X˜ is formed by X˜ =
(35) m i=1
Xi .
Example 1. In this example, we set m = 5, pc = 0.5, qi = 20, for all i = 1, . . . , 5, and n = 2. The entries of adjacency matrix A is generated by Metropolis weights [42]. The feature vectors, sij , i = 1, . . . , 5, j = {1, . . . , qi }, follow a Gaussian distribution with mean zero and variance 1.5. The labels bij , i = 1, . . . , 5, j = {1, . . . , qi }, are randomly generated by utiliz ing standard Bernoulli distribution. Constrained set X˜ is X˜ = 5i=1 Xi , where Xi = [Xi1 , Xi2 ]. In detail, the private constrained sets for nodes are as: X11 = [0.3, 1.2], X12 = [−0.6, 1.5], X21 = [0.1, 1.0], X22 = [−0.4, 1.0], X31 = [0.5, 1.8], X32 = [−0.5, 1.5], X41 = [0.4, 2.2], X42 = i i [−0.35, 1.0], X51 = [0.3, 2.0], and X52 = [−0.2, 1.3]. Let xi = [xi1 , xi2 ]. Denote xk, 1 and xk,2 1 2 as the estimates of xi and xi by node i at time k, respectively. We can see from Fig. 1 that the estimate of each node’s variable eventually converges to the global optimization solution. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
13
∗ Fig. 2. Comparison across different algorithms. The residual log10 m i=1 xi − x˜ is shown with respect to number of iterations and number of gradient evaluations in Fig. 2(a) and (b), respectively. Fig. 2(c) presents the results with respect to real time in simulation. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
14
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Example 2. In this example, we set m = 10 and pc = 0.3. The private set constraint for each node is Xi = Rn , i = 1, . . . , 10. Set qi = 20 for all i = 1, . . . , 10, and n = 2. A Gaussian distribution with zero mean and variance 2 is utilized to generate the feature vectors, sij , i = 1, . . . , 10, j = {1, . . . , qi }. The labels bij , i = 1, . . . , 10, j = {1, . . . , qi }, are generated from standard Bernoulli distribution. Denote the global optimization solution by x˜∗ . Simulation results are provided in Fig. 2 by comparing with typical distributed algorithms, including DGD algorithm [15], DSA [36], EXTRA [25], and the proposed algorithm. The DGD algorithm is carried out separately for two cases: the constant step-size and diminishing step-size. The step-size for the above algorithms is hand-tuned for best performance. The simulation results in Fig. 2 show that when the step-size is diminishing, the DGD algorithm could converge to the global optimization solution but has a slower convergence rate than other algorithms. Besides, DSA, EXTRA, and the proposed algorithm have a linear rate. From Fig. 2(a), it is known that EXTRA possesses a faster convergence than DSA and the proposed algorithm with respect to number of iterations. But, as shown in Fig. 2(b), EXTRA suffers from high computation cost of the local gradients, which is further illustrated by the convergence results with respect to real time in Fig. 2(c). In addition, By Fig. 2 we observe that for the constrained problems the proposed algorithm has similar results with DSA in convergence rate and number of gradients evaluations, and has a little more time cost than DSA because of evaluations of dual variables. But the computation speed of the proposed j i algorithm can be further improved if we utilize the value mj=1 ai j (xk+1 − xk+1 ) calculated in dual variables at time k to update primal variables at time k + 1 during the implementation of algorithms (9) and (10). 7. Conclusion In this paper, we investigated the distributed convex optimization problem defined in an undirected and connected network, where each node’s variable was constrained to a private closed convex set, and each local objective function was further set as an average of a number of local instantaneous functions. Based on unbiased stochastic averaging gradients, a primal-dual stochastic distributed algorithm with significantly reducing the cost of evaluation of gradients was proposed. The proposed algorithm provided a linear convergence for the unconstrained case. The correctness of the theoretical results was further demonstrated by numerical simulations. Future work will study distributed convex optimization problem over directed and time-varying networks. Acknowledgment The work described in this paper was supported in part by the Innovation Support Program for Chongqing Overseas Returnees under Grant no. cx2017043, in part by the Special Financial Support from Chongqing Postdoctoral Science Foundation under Grant no. Xm2017100, in part by the National Natural Science Foundation of China under Grant nos. 61773321, 61673080 and 61762020, in part by the High-level Innovative Talents Project of Guizhou under Grant no. QRLF201621, in part by the Science and Technology Top-notch Talents Support Project of Colleges and Universities in Guizhou under Grant no. QJHKY2016065, in part by the Science and Technology Foundation of Guizhou under Grant nos. QKHJC20161076 and QKHJC20181083, and in part by the National Statistical Science Research Project of China under Grant no. 2018LY66. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
15
Appendix A. Proof of Theorem 1 In the section, to prove that the Lyapunov function defined in Eq. (27) converges to zero, it is equivalent to show that the expectation value of the function (27) at time k + 1 respected to Fk is smaller than the previous value at time k. Concretely, a necessary need is to find an upper bound for the difference between the value of function (27) at time k + 1 and at time k in expectation respected to Fk , and a sufficient condition is required to make sure that the difference is less than zero. To do so, firstly, we probe for an upper bound for the difference between V (xk+1 , λk+1 ) and V(xk , λk ). By Eq. (23), we have 2 V (xk+1 , λk+1 ) − V (xk , λk ) = −xk+1 − xk W − λk+1 − λk 2 T T + 2 xk+1 − x ∗ W (xk+1 − xk ) + 2 λk+1 − λ∗ (λk+1 − λk ) (36)
We proceed by finding an upper for the last two terms of Eq. (36) below. By Eqs. (21) and (22), we can rewrite W (xk+1 − xk ) as W (xk+1 − xk ) = −αL xk+1 − x ∗ − αL λk+1 − λ∗ + α 2 L 2 xk+1 − x ∗ − αgk − αL λ∗ + zk+1 (37) From Eq. (19), we have λk+1 − λk = αL xk+1 . Based on this observation and (37), the last two terms of Eq. (36) can be derived as T T 2 xk+1 − x ∗ W (xk+1 − xk ) + 2 λk+1 − λ∗ (λk+1 − λk ) 2 T = −xk+1 − x ∗ 2 (αL−α2 L2 ) − 2α xk+1 − x ∗ gk − ∇ f (x ∗ ) T T (38) − 2α xk+1 − x ∗ ∇ f (x ∗ ) + L λ∗ + 2 xk+1 − x ∗ zk+1 Since La (x, λ) is convex in X for each λ, we have T xk+1 − x ∗ ∇ f (x ∗ ) + L λ∗ ≥ 0
(39)
It follows from [29] that T xk+1 − x ∗ zk+1 ≤ 0
(40)
By Eqs. (39) and (40), we obatin an upper bound for the left side of Eq. (38) as T T 2 xk+1 − x ∗ W (xk+1 − xk ) + 2 λk+1 − λ∗ (λk+1 − λk ) 2 T ≤ −xk+1 − x ∗ 2 (αL−α2 L2 ) − 2α xk+1 − x ∗ gk − ∇ f (x ∗ ) Note that the last term in the right side of Eq. (41) can be rewritten as T xk+1 − x ∗ gk − ∇ f (x ∗ ) T = (xk+1 − xk )T (gk − ∇ f (xk ) ) + xk − x ∗ (gk − ∇ f (xk ) ) T T + xk+1 − x ∗ (∇ f (xk ) − ∇ f (xk+1 )) + xk+1 − x ∗ ∇ f (xk+1 ) − ∇ f (x ∗ )
(41)
(42)
Applying the inequality 2aT b ≥ −ηa2 − 1/ηb2 , η > 0 for the vectors a = (xk+1 − xk )T and b = gk − ∇ f (xk ), the first term in the right side of Eq. (42) can be derived as η 1 gk − ∇ f (xk )2 (xk+1 − xk )T (gk − ∇ f (xk ) ) ≥ − xk+1 − xk 2 − 2 2η
(43)
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
16
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Similarly, for the third term in the right side of Eq. (42), it follows that 2 T φ 1 ∇ f (xk ) − ∇ f (xk+1 )2 xk+1 − x ∗ (∇ f (xk ) − ∇ f (xk+1 )) ≥ − xk+1 − x ∗ − 2 2φ
(44)
where the constant φ > 0. On the basis of Lipschitz-continuous gradient of f, for the second term in the right side of Eq. (44), we have −
L 2f 1 ∇ f (xk ) − ∇ f (xk+1 )2 ≥ − xk − xk+1 2 2φ 2φ
(45)
Since f is strongly convex with parameter μ, for the second term in the right side of Eq. (42), it can be derived as 2 T xk+1 − x ∗ ∇ f (xk+1 ) − ∇ f (x ∗ ) ≥ μxk+1 − x ∗ (46) By Eqs. (42)–(45), one can obtain T −2α xk+1 − x ∗ gk − ∇ f (x ∗ ) T α ≤ αηxk+1 − xk 2 + gk − ∇ f (xk )2 − 2α xk − x ∗ (gk − ∇ f (xk ) ) η 2 αL 2f 2 xk+1 − xk 2 − 2αμxk+1 − x ∗ + αφ xk+1 − x ∗ + φ
(47)
Combining Eqs. (41) and (47), we obtain an upper bound for the left side of Eq. (36) xk+1 − x ∗ 2 + λk+1 − λ∗ 2 − xk − x ∗ 2 − λk − λ∗ 2 W W 2 2 ≤ −xk+1 − xk W − λk+1 − λk 2 − xk+1 − x ∗ 2 (αL−α2 L2 ) + α(η +
L 2f
2 )xk+1 − xk 2 + α(φ − 2μ)xk+1 − x ∗
φ T α + gk − ∇ f (xk )2 − 2α xk − x ∗ (gk − ∇ f (xk ) ) η
(48)
Then, we take the expectation respected to Fk in both sides of Eq. (48). By Eq. (15), It is known that E gk − ∇ f (xk )|Fk = 0. Besides, according to the definition of Fk , it is known 2 that the expectation of xk − x ∗ W − λk − λ∗ 2 respected to Fk is deterministic. Combining these observations, we have 2 2 2 2 E xk+1 − x ∗ W Fk +E λk+1 − λ∗ Fk − xk − x ∗ W −λk − λ∗ 2 2 Fk −E λk+1 − λk 2 Fk − E xk+1 − x ∗ Q Fk ≤ −E xk+1 − xk W α (49) + E xk+1 − xk 2P Fk + E gk − ∇ f (xk )2 Fk η L2
where P = α(η + φf )I . Next, we proceed by finding an upper bound for the last term in the right side of Eq. (49). Note that it can be expanded as α 2 α (50) E gk − ∇ f (xk )2 Fk = E gk − ∇ f (x ∗ ) − ∇ f (xk ) − ∇ f (x ∗ ) Fk η η Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
17
and E gk − ∇ f (x ∗ )|Fk is equal to ∇ f (xk ) − ∇ f (x ∗ ). Based on the standard variance decomposition E[a − E[a]2 ] = E[a2 ] − E[a]2 , we derive α 2 α 2 α E gk − ∇ f (xk )2 Fk = E gk − ∇ f (x ∗ ) Fk − ∇ f (xk ) − ∇ f (x ∗ ) (51) η η η m Observe that if a function g is strongly convex with μ, then for two vectors x, y ∈ R we have 2 T ∇ g(x ) − ∇ g(y ) ≥ 2μ g(x ) − g(y ) − (∇g(y ) ) (x − y ) . By strongly convexity of f, the last term in Eq. (51) follows that 2 T α 2αμ f ( xk ) − f x ∗ − ∇ f x ∗ xk − x ∗ (52) − ∇ f (xk ) − ∇ f (x ∗ ) ≤ − η η
From Lemma 4, one can obtain 2 4α T α 2α E gk − ∇ f (x ∗ ) Fk ≤ L f pk + 2 L f − μ f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ η η η (53) Substituting Eqs. (52) and (53) into Eq. (51) yields 4α T α 4α E gk − ∇ f (xk )2 Fk ≤ L f pk + L f − μ f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ η η η (54) In the both sides of Eq. (49), add and substrate an auxiliary term E[xk+1 − x ∗ 2γ Q |Fk ]. Then, substituting Eq. (54) into Eq. (49), and adding c(E[ pk+1 |Fk ] − pk ), c > 0, in the both sides of Eq. (49), it follows that 2 2 2 E xk+1 − x ∗ W Fk + E λk+1 − λ∗ Fk + E xk+1 − x ∗ γ Q Fk 2 2 2 + cE pk+1 Fk − xk − x ∗ W − λk − λ∗ − xk − x ∗ γ Q − c pk 2 2 Fk − E λk+1 − λk 2 Fk − E xk+1 − x ∗ Q Fk ≤ −E xk+1 − xk W 2 4α L f pk + E xk+1 − x ∗ γ Q Fk + E xk+1 − xk 2P Fk + η T 4α L f − μ f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ + η 2 (55) − xk − x ∗ γ Q + c E pk+1 Fk − pk From Lemma 5, we have T c c f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ c E pk+1 Fk − pk ≤ − pk + qmax qmin
(56)
In order to ensure that the sequence (xk , λk ) generated by algorithms (18) and (19) converges to the global optimization solution, the left side of inequality (55) should be less than zero. Combining Eqs. (55) and (56) yields a sufficient condition for this as 2 2 Fk + E λk+1 − λk 2 Fk + E xk+1 − x ∗ Q Fk 0 < E xk+1 − xk W 2 2 − E xk+1 − x ∗ γ Q Fk − E xk+1 − xk 2P Fk + xk − x ∗ γ Q Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
18
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
+
c qmax
−
c Lf 4α 4α xk − x ∗ 2 Lf − μ + L f pk − η η qmin 2
(57)
Inequality (57) holds if W −P >0 ( 1 − γ )Q > 0 c 4α Lf > qmax η Lf c 4α Lf − μ + γ λmin (Q ) − >0 η qmin 2
(58)
The conditions in Eq. (58) are satisfied when parameters α, γ , φ, η and c satisfy the following conditions 1 0<α< L2 λmax (L ) + η + φf 0 < φ < 2μ η>
2L 2f qmax
qmin γ (2μ − φ) 4α L f qmax c> η
(59)
where 0 < γ < 1. 2 Therefore, when the conditions in Eq. (59) are satisfied, the sequence xk − x ∗ W + ∗ 2 ∗ 2 λk − λ +xk − x γ Q +c pk converges to zero almost surely. As a directed consequence, 2 the sequence xk − x ∗ W and λk − λ∗ 2 also converges to zero almost surely. By the defini2 tion of W, we have λmin (W )xk − x ∗ 2 ≤ xk − x ∗ W . Base on this observation, It is known ∗ 2 that the sequence xk − x almost surely converges to zero. Consequently, the sequence xk almost surely converges to the global optimization solution x∗ to problem (2), and the sequence of local variable xi , i ∈ V, converges to the global optimization solution x˜∗ . This completes the proof. Appendix B. Proof of Theorem 2 In the section, the main idea aims at finding an upper bound for the expectation of the δ times value of the Lyapunov function defined in Eq. (27) at time k + 1 respected to Fk . When Xi = R, i ∈ V, inequality (55) can also be derived. Combining this upper bound and inequality (55), a sufficient condition is required to make sure inequality (30) hold. Firstly, by Eq. (56), inequality (55) can be transformed into 2 Fk + E xk+1 − xk 2P Fk E xk+1 − xk W 2 + E λk+1 − λk 2 Fk + E xk+1 − x ∗ γ Q Fk 4αL f c ∗ 2 ∗ 2 pk − + E xk+1 − x Q Fk + xk − x γ Q − η qmax Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
19
T c 4α Lf − μ + f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ η qmin 2 2 2 ∗ ≤ xk − x W + λk − λ∗ + xk − x ∗ γ Q + c pk 2 2 − E xk+1 − x ∗ W Fk + E λk+1 − λ∗ Fk 2 + E xk+1 − x ∗ γ Q Fk + E c pk+1 Fk
(60)
Let A be the left side of Eq. (60). To prove Eq. (30), it is equivalent to showing 2 2 δ E xk+1 − x ∗ W Fk + E λk+1 − λ∗ Fk 2 + E xk+1 − x ∗ γ Q Fk + E c pk+1 Fk 2 2 2 ≤ xk − x ∗ W + λk − λ∗ + xk − x ∗ γ Q + c pk 2 2 − E xk+1 − x ∗ W Fk + E λk+1 − λ∗ Fk 2 + E xk+1 − x ∗ γ Q Fk + E c pk+1 Fk
(61)
−
Let B be the left side of Eq. (61). In order to make inequality (61) valid, an upper bound for B is required. Firstly, we proceed by finding a non-negative upper bound for the expectation of the term λk+1 − λ∗ 2 respected to Fk in B. In the unconstrained situation, the global optimization solution x∗ satisfies that ∇ f ( x ∗ ) + L λ∗ = 0
(62)
Combining Eqs. (28), (29) and (62), we have xk+1 − xk + α gk − ∇ f (x ∗ ) + L (λk+1 − αL xk+1 ) − L λ∗ + L xk = 0
(63)
which is equivalent to I − α 2 L 2 xk+1 − x ∗ − (I − αL ) xk − x ∗ 2 = α gk − ∇ f (x ∗ ) + αL λk+1 − λ∗ 2 (64) Then, applying a + b2 ≥(1 − θ )a2 + 1 − θ1 b2 , θ > 1 for the vectors a = α(gk − ∇ f (x ∗ ) ), b = αL (λk+1 − λ∗ ), it follows that I − α 2 L 2 xk+1 − x ∗ − (I − αL ) xk − x ∗ 2 2 1 L λk+1 − λ∗ 2 ≥ −α 2 (θ − 1 )gk − ∇ f (x ∗ ) + α 2 1 − (65) θ By the definition of L, we have that both L and L2 have m − 1 non-negative eigenvalues and a zero eigenvalue. We write the eigenvalues of L and L2 in a non-decreasing order as 0 = λ1 (L) < λ2 (L) ≤ · · · ≤ λm (L) and 0 = λ1 (L 2 ) < λ2 (L 2 ) ≤ · · · ≤ λm (L 2 ), respectively. Note that the second term in the right side of Eq. (65) can be derived as L λk+1 − λ∗ 2 ≥ λ2 L 2 λk+1 − λ∗ 2 (66) By Eqs. (25), (65) and (66), the expectation of the term λk+1 − λ∗ 2 respected to Fk is bounded by: Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
20
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
2 E λk+1 − λ∗ Fk ≤
2 E I − α 2 L 2 xk+1 − x ∗ − (I − αL ) xk − x ∗ Fk α 2 1 − λ2 L 2 2 ( θ − 1 ) 2L f − μ 4 ( θ − 1 )L f ∗ ∗ T ∗ pk x + + ( x ) − f ( x ) − ∇ f ( x ) − x f k k 1 1 − θ λ2 L 2 1 − θ1 λ2 L 2 1
1 θ
For the first term in the right side of Eq. (67), we can derive 2 E I − α 2 L 2 xk+1 − xk + xk − x ∗ − (I − αL ) xk − x ∗ Fk 2 Fk + xk − x ∗ 2 2 2 T ≤ E xk+1 − xk T 2 (−α L +αL ) (−α 2 L 2 +αL ) 2 (I−α 2 L 2 ) (I−α 2 L 2 ) Thus, by Eqs. (67) and (68) we obtain an upper bound for B as follows: 2 B ≤ E xk+1 − x ∗ δW +δγ Q Fk 1 2 Fk E + δ 2 x − x T k+1 k 2 2 2 2 1 2 (I−α L ) (I−α L ) α 1 − θ λ2 L 2 2 1 xk − x ∗ 2 (−α2 L2 +αL )T (−α2 L2 +αL ) + δ 2 1 α 1 − θ λ2 L 2 4 ( θ − 1 )L f 1 δ pk + c(1 − )+ qmax 1 − θ1 λ2 L 2 2 ( θ − 1 ) 2L f − μ c ∗ ∗ T ∗ f x + δ +δ ( x ) − f ( x ) − ∇ f ( x ) − x k k qmin 1 − θ1 λ2 L 2
(67)
(68)
(69)
Let D be the right side of Eq. (69). To make inequality (61) valid, a sufficient condition is D ≤ A. By Eq. (29), the term E λk+1 − λk 2 Fk in A can be rewritten as 2 E λk+1 − λk 2 Fk =E α 2 xk+1 − x ∗ L2 Fk (70) Since f has the Lipschitz-continuous gradient, the last term in D follows that T Lf xk − x ∗ 2 f ( xk ) − f ( x ∗ ) − ∇ f ( x ∗ ) xk − x ∗ ≤ (71) 2 Thus, by Eqs. (69)–(71) we can obtain a sufficient condition to make inequality D ≤ A valid as follows: 1 2 F E xk+1 − xk δ 2 2 L 2 T I−α 2 L 2 k 1 2 2 I−α ( ) ( ) α 1− λ L θ
2
2 1 xk − x ∗ 2 (−α2 L2 +αL )T (−α2 L2 +αL ) 1 2 1 − θ λ2 L 4 ( θ − 1 )L f 1 xk+1 − x ∗ 2 + c(1 − )+ δ p + E k δW +δγ Q Fk qmax 1 − θ1 λ2 L 2 c 2 ( θ − 1 ) 2L f − μ Lf 4α c xk − x ∗ 2 L + + δ +δ − μ + f 1 2 qmin η qmin 2 1 − θ λ2 L +δ
α2
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
2 2 ≤ E xk+1 − x ∗ (1−γ )Q+α2 L2 Fk + E xk+1 − xk W −P Fk 2 c 4α + xk − x ∗ γ Q + L f pk − qmax η
21
(72)
Inequality (72) holds if the following conditions are satisfied (1 − γ )Q + α 2 L 2 ≥ δW + δγ Q
(73)
T 2 I − α2 L2 I − α2 L2 W −P ≥δ α 2 1 − θ1 λ2 L 2
(74)
IL f 2 ( θ − 1 ) 2L f − μ L f 4α Lf − μ γ Q ≥ (δ + 1) +δ + qmin η 2 1 − θ1 λ2 L 2 2 2 T 2 2 2 −α L + αL −α L + αL +δ α 2 1 − θ1 λ2 L 2 c qmax
c
4 ( θ − 1 )L f 4α 1 L f ≥ δ c(1 − − )+ η qmax 1 − θ1 λ2 L 2
(75)
(76)
where W > 0 and Q > 0. Inequalities W > 0 and Q > 0 hold when the parameters α and φ satisfy 1 λmax (L) 0 < φ < 2μ 0<α<
(77)
Equality (73) can be transformed into (1 − γ − δγ )α(2μ − φ)I ≥ δ(I − αL ) where 0 < γ < 1, which is further equivalent to α(1 − γ − δγ )(2μ − φ) ≥δ>0 (78) 1 − αλi (L ) for i = 1, . . . , m. −δγ )(2μ−φ) Since arg minλi (L),i=1,...,m ( α(1−γ1−αλ ) = λ1 (L) = 0, then Eq. (78) can be derived as i (L ) A1 = α(1 − γ − δγ )(2μ − φ) ≥ δ > 0
(79)
To make Eq. (74) hold, it is sufficient to show W − P = I − αL − α(η +
L 2f φ
)I > 0
α 2 1 − θ1 λ2 L 2 1 − αλi (L ) − α(η + 2 2 1 − α 2 λi L 2
(80) L 2f φ
)
≥δ
(81)
for i = 1, . . . , m. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
22
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
Define A2 = mini∈V { have 0<α<
L2 α 2 (1− θ1 )λ2 (L 2 ) 1−αλi (L )−α(η+ φf ) 2 (I−α 2 λi (L 2 ) )
2
}. From inequalities (80) and (81), we
1 λmax (L )+η+
0 < δ ≤ A2
L2 f φ
(82)
Noting that αL − α 2 L 2 ≥ 0, we have γ λmin (Q ) ≥ γ α(2μ − φ) To make Eq. (75) hold, it is sufficient to show Lf α c 0 < γ α(2μ − φ) − 4 L f − μ + η qmin 2 2 2 −α 2 (λi (L) )2 + αλi (L) (θ − 1 ) 2L f − μ L f cL f δ +δ +δ 1 1 2 2 2 2qmin 1 − θ λ2 L α 1 − θ λ2 L Lf cL f α ≤ γ α(2μ − φ) − 4 L f − μ − η 2 2qmin
(83)
(84)
(85)
for i = 1, . . . , m. Let Ti 1 be the sum of coefficients of δ in the left side of Eq. (85), i.e., 2 2 −α 2 (λi (L) )2 + αλi (L) (θ − 1 ) 2L f − μ L f cL f 1 Ti = + + 1 1 2 2 2 2qmin 1 − θ λ2 L α 1 − θ λ2 L and Ti 2 be the right side of Eq. (85), i.e., Lf cL f 4α Lf − μ Ti 2 = γ α(2μ − φ) − − η 2 2qmin From Eq. (85), we can obtain an upper bound for δ 2 T 0 < δ ≤ min i 1 = A3 i∈V Ti Inequality (84) holds if Lf 4α c Lf − μ + < γ α(2μ − φ) η qmin 2 which follows 2γ α 4α c Lf − μ < (2μ − φ) − η Lf qmin
(86)
(87)
(88)
By Eq. (88), we have c<
2qmin γ α (2μ − φ) Lf
(89)
Inequality (76) holds when the following conditions are satisfied − 4 αη L f 0<δ≤ = A4 4 (θ−1 )L 1 + 1− 1 λ Lf 2 c 1 − qmax ( θ ) 2( ) c qmax
(90)
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
ARTICLE IN PRESS
JID: FI
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
0<
c qmax
−
4α L η f
23
(91)
From Eq. (91), we have c>
4α L f qmax η
(92)
Combining Eqs. (89) and (92) yields an condition for parameter η η>
2L 2f qmax
(93)
qmin γ (2μ − φ)
Therefore, Inequalities (73)–(76) hold, if δ is chosen as δ ≤ min {A1 , A2 , A3 , A4 }
(94)
where the parameters η, φ, α, c and γ satisfy 2L 2 qmax
η > qmin γ f(2μ−φ) 0 < φ < 2μ 1 0<α<
L2
λmax (L )+η+ φf γα 4α L q < c < 2qmin (2μ η f max Lf
− φ)
0<γ <1
This completes the proof. References [1] R.L. Raffard, C.J. Tomlin, S.P. Boyd, Distributed optimization for cooperative agents: application to formation flight, in: Proceedings of the 43rd IEEE Conference on Decision and Control, 2014, pp. 2453–2459. [2] Y. Wang, Z. Ma, G. Chen, Distributed control of cluster lag consensus for first-order multi-agent systems on QUAD vector fields, J. Frankl. Inst. 355 (15) (2018) 7335–7353. [3] L. Dai, Q. Cao, Y. Xia, Y. Gao, Distributed MPC for formation of multi-agent systems with collision avoidance and obstacle avoidance, J. Frankl. Inst. 354 (4) (2017) 2068–2085. [4] M. Rabbat, R. Nowak, Distributed optimization in sensor networks, in: Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks, 2004, pp. 20–27. [5] L. Xiao, S. Boyd, S. Lall, A scheme for robust distributed sensor fusion based on average consensus, in: Proceedings of the Fourth International Symposium on Information Processing in Sensor Networks, 2005, pp. 63–70. [6] H. Li, S. Liu, Y.C. Soh, L. Xie, Event-triggered communication and data rate constraint for distributed optimization of multi-agent systems, IEEE Trans. Syst. Man Cybern. Sys. 48 (11) (2018) 1908–1919. [7] B. Wang, Z. Li, X. Yan, Multi-subspace factor analysis integrated with support vector data description for multimode process, J. Frankl. Inst. 355 (15) (2018) 7664–7690. [8] F. Brockherde, L. Vogt, L. Li, Bypassing the kohn-sham equations with machine learning, Nat. Commun. 8 (2017) 872. [9] R. Cavalcante, I. Yamada, B. Mulgrew, An adaptive projected subgradient approach to learning in diffusion networks, IEEE Trans. Signal Process. 57 (7) (2009) 2762–2774. [10] M. Chiang, P. Hande, T. Lan, Power control in wireless cellular networks, Found. Trends Netw. 2 (4) (2008) 381–533. [11] U.A. Khan, S. Kar, J.M. Moura, DILAND: an algorithm for distributed sensor localization with noisy distance measurements, IEEE Trans. Signal Process. 58 (3) (2010) 1940–1947. [12] Y. He, I. Lee, L. Guan, Distributed algorithms for network lifetime maximization in wireless visual sensor networks, IEEE Trans. Circuits Syst. Video Technol. 199 (5) (2009) 704–718. Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
24
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
[13] J. Zhao, Y. Wen, R. Shang, Optimizing sensor node distribution with genetic algorithm in wireless sensor network, in: Proceedings of the International Symposium on Neural Networks, 3176, 2004, pp. 242–247. [14] H. Li, Q. Lu, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network, IEEE Trans. Netw. Sci. Eng., Regular Paper doi:10.1109/TNSE.2018.2848288. [15] A. Nedíc, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Trans. Autom. Control 54 (1) (2009) 48–61. [16] K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent, SIAM J. Optim. 26 (3) (2016) 1835–1854. [17] D. Jakovetic, J. Xavier, J.M.F. Moura, Fast distributed gradient methods, IEEE Trans. Autom. Control 59 (5) (2014) 1131–1146. [18] H. Li, Q. Lu, T. Huang, Distributed projection subgradient algorithm over time-varying general unbalanced directed graphs, IEEE Trans. Autom. Control 64 (3) (2019) 1309–1316. [19] C. Xi, Q. Wu, U.A. Khan, On the distributed optimization over directed networks, Neurocomputing 267 (2017) 508–515. [20] H. Li, Q. Lu, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs, IEEE Trans. Syst. Man Cybern. Syst. 48 (11) (2018) 1908–1919. [21] S. Boyd, N. Parikh, E. Chu, B. Peleato, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [22] S. Magnusson, P.C. Weeraddana, C. Fischione, A distributed approach for the optimal power-flow problem based on ADMM and sequential convex approximations, IEEE Trans. Control Netw. Syst. 2 (3) (2015) 238–253. [23] J.F. Mota, J.M. Xavier, P.M. Aguiar, DADMM: a communication-efficient distributed algorithm for separable optimization, IEEE Trans. Signal Process. 61 (10) (2013) 2718–2723. [24] Q. Ling, A. Ribeiro, Decentralized linearized alternating direction method of multipliers, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 5447–5451. [25] W. Shi, Q. Ling, G. Wu, W.T. Yin, EXTRA: An exact first-order algorithm for decentralized consensus optimization, SIAM J. Optim. 25 (2) (2015) 944–966. [26] A. Nedíc, A. Ozdaglar, Subgradient methods for saddle-point problems, J. Optim. Theory Appl. 142 (1) (2009) 205–228. [27] M. Zhu, S. Martínez, On distributed convex optimization under inequality and equality constraints, IEEE Trans. Autom. Control 57 (1) (2012) 151–164. [28] T. Chang, A. Nedíc, A. Scaglione, Distributed constrained optimization by consensus-based primal-dual perturbation method, IEEE Trans. Autom. Control 59 (6) (2014) 1524–1538. [29] J. Lei, H. Chen, H. Fang, Primal-dual algorithm for distributed constrained optimization, Syst. Control Lett. 96 (2016) 110–117. [30] S. Ram, A. Nedíc, V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization, J. Optim. Theory Appl. 147 (3) (2010) 515–545. [31] B. Johansson, M. Rabi, M. Johansson, A simple peer-to-peer algorithm for distributed optimization in sensor networks, in: Proceedings of the 46th IEEE Conference on Decision and Control, 2017, pp. 4705–4710. [32] D. Blatt, O. Hero, H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim. 18 (1) (2017) 25–51. [33] S. Lee, A. Nedíc, Distributed random projection algorithm for convex optimization, IEEE J. Sel. Top. Signal Process. 7 (2) (2013) 221–229. [34] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Adv. Neural Inf. Process. Syst. (2013) 315–323. [35] A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, Adv. Neural Inf. Process. Syst. (2014) 1646–1654. [36] A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm, J. Mach. Learn. Res. 17 (2016) 1–35. [37] S.S. Ram, A. Nedíc, V. Veeravalli, Stochastic incremental gradient descent for estimation in sensor networks, in: Proceedings of the Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers, 2007, pp. 582–586. [38] H.T. Wai, N.M. Freris, A. Nedíc, A. Scaglione, SUCAG: stochastic unbiased curvature-aided gradient method for distributed optimization, 2018 arXiv:1803.08198. [39] N. Roux, M. Schmidt, F.R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, Adv. Neural Inf. Process. Syst. (2012) 2663–2671.
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018
JID: FI
ARTICLE IN PRESS
[m1+;September 21, 2019;0:15]
Y. Niu, H. Wang and Z. Wang et al. / Journal of the Franklin Institute xxx (xxxx) xxx
25
[40] S. De, T. Goldstein, Effcient distributed SGD with variance reduction, in: Proceedings of the IEEE 16th International Conference on Data Mining, 2016, pp. 111–120. [41] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University press, Cambridge, UK, 2004. [42] L. Xiao, S. Boyd, S. Lall, A apace-time diffusion scheme for peer-to-peer least-squares estimation, in: Proceedings of the 5th International Conference on Information Processing in Sensor Networks, 2016. 168–17 [43] Q. Liu, J. Wang, A second-order multi-agent network for bound-constrained distributed optimization, IEEE Trans. Autom. Control 60 (12) (2015) 3310–3315.
Please cite this article as: Y. Niu, H. Wang and Z. Wang et al., Primal-dual stochastic distributed algorithm for constrained convex optimization, Journal of the Franklin Institute, https:// doi.org/ 10.1016/ j.jfranklin.2019.07.018