Signal Processing 98 (2014) 212–223
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
Unbiased consensus in wireless networks via collisional random broadcast and its application on distributed optimization Hui Feng, Xuesong Shi, Tao Yang, Bo Hu n Department of Electronic Engineering, Fudan University, 200433, China
a r t i c l e in f o
abstract
Article history: Received 9 May 2013 Received in revised form 26 September 2013 Accepted 12 November 2013 Available online 1 December 2013
We first propose an unbiased consensus algorithm in wireless networks via random broadcast, by which all the nodes tend to the initial average in mean almost surely. The innovation of the algorithm lies in that it can work in any connected topology, in spite of the possible collisions from simultaneous data arriving at receivers in a shared channel. Based on the consensus algorithm, we propose a distributed optimization algorithm for a sum of convex objective functions, which is the fundamental model for many applications on signal processing in network. Simulation results show that our algorithms provide an appealing performance with lower communicational complexity compared with existing algorithms. & 2013 Elsevier B.V. All rights reserved.
Keywords: Consensus Random broadcast Gossip Distributed optimization
1. Introduction In a multi-hop wireless network, each node communicates with others via broadcast wireless media, where data are usually exchanged among neighboring nodes due to the power constraint. In the early period of the study on wireless networks, many researchers focused on the multi-hop communication strategies, such as the MAC and routing protocols [1]. With the development of sensor network and Internet of Things (IoT), a smart network embraces necessary signal processing jobs in-network [2], which considers the data transmission and processing in integrated way. The potential applications include distributed estimation [3–6], data fusion [7], classification [8], and recovery for sparse signals through l1-norm minimization [9]. Consensus is a representative topic in network processing. The goal of consensus is to design a local (not global) information exchanging strategy, by which all nodes will n
Corresponding author. Tel.: þ86 21 65642762. E-mail addresses:
[email protected] (H. Feng),
[email protected] (X. Shi),
[email protected] (T. Yang),
[email protected],
[email protected] (B. Hu). 0165-1684/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sigpro.2013.11.017
tend to the same state asymptotically. The fundamental theory of linear consensus roots in the classic linear dynamical system and the Markov chain in stochastic process. Wang et al. gave a necessary and a sufficient condition on the consensus of a general output-feedback linear multi-agent system [10]. The consensus of network was discussed in [11] with infinite time-delay, packet loss and quantization in communication. The model predictive controller was introduced to accelerate the consensus rate [12]. Su et al. investigated the consensus of a distributed T-Z fuzzy filter with time-varying delays and link failures [13]. Ren et al. [14] and Olfati-Saber et al. [15] surveyed a lot of mainstream consensus algorithms, and analyzed the convergence results under various information exchanging strategies. By their styles of information exchanging, existing consensus algorithms can be divided into three categories as follows, Style I : Pair-wise exchange; Style II : Local fusion; Style III : Asynchronous broadcast gossip. In Style I, two neighboring nodes exchange and mix their data each time. The pioneering works of Style I brought in the P2P file sharing over the Internet [16].
H. Feng et al. / Signal Processing 98 (2014) 212–223
As for Style II, each node acquires information from all its neighbors, and then makes linear mixing with its own data simultaneously [17]. Style III is of an asynchronous broadcast way, where each node broadcasts data to its neighbors. If the data from the neighbors is received successfully, the node mixes the received data with its own data. Obviously, Style III is very suitable for a wireless network, due to the inherent broadcast property of wireless communication. However, research works on Style III are not as plenary as that on Styles I and II, in which [18,19] are two representative works. A consensus result can be reached in a deterministic or random way. Most algorithms collected in [15] rely on deterministic information exchanges, where all the nodes will reach a consensus asymptotically as the following definition. Definition 1. In a network consisting of N nodes, where node i has an initial value at slot 0 as x0i A RM ; i ¼ 1; 2; …; N. The network reaching an asymptotical consensus result means that there exists x1 A RM such that limk-1 xki ¼ x1 ; i ¼ 1; 2; …; N. Further, if x1 ¼ x 0 9 N 1 ∑i x0i , we have the average consensus (AC) result. In contrast to deterministic ones, the works of [18–21] investigated the random algorithms involving random information exchanges over the network. For instance as in [19], each node does random broadcast gossip as Style III. Compared with deterministic ones, random algorithms may speed up the convergence rate, and substantially reduce the communication cost [22]. A random consensus algorithm usually guides all the nodes’ values in a network to be the same almost surely (a.s.) asymptotically as the following definition. Definition 2. In a network consisting of N nodes, where node i has an initial value at slot 0 as x0i A RM ; i ¼ 1; 2; …; N. The network reaching probabilistic consensus (PC) means that there exists x1 A RM such that limk-1 xki ¼ x1 ; i ¼ 1; 2; …; N a.s. Further, if E½x1 ¼ x 0 ¼ N 1 ∑i x0i , we have the unbiased consensus (UC) result. In a wireless network using shared channels, there are possible collisions on receivers. Collisions may change the topology of a network temporarily, which make the consensus procedure sophisticate. Two methods have been considered to solve the problem of collisions. The first is to avoid any possible collision by designing special consensus strategy. Aysal et al. [19] proposed a random gossip algorithm, where only one node may wake up each time and do broadcast. The authors proved that such algorithm can reach UC, and yet not efficiently, since only one node can broadcast each time. In a practical scenario, we expect that more nodes would broadcast simultaneously, such that data will be disseminated over the network faster. Fagnani et al. [21] designed a random broadcast strategy, the collision broadcast gossip algorithm (CBGA), where each node broadcasts with the same probability. However, the authors only proved the UC of CBGA in specific Abelian Caylay topologies, while their algorithm may not reach UC in other connected networks. As a matter of fact, there is an inherent relationship between the consensus problem and the distributed optimization problem. As indicated in [23], the consensus problem is a special case of the variance minimization problem over the network, where the objective function is
213
min ∑i J xki x k J 2 . The collaborative optimization over network is an important method to solve practical problems in wired and wireless networks, such as signal processing [24], distributed learning [25], and automatic control [26]. In such cases, the objective function is usually with the form of a sum of multiple components, i.e., f ðxÞ ¼ ∑N i ¼ 1 f i ðxÞ, where each component belongs to a specific processing node in network. Each node expects to obtain a global optimization result by local information exchanges among neighbors [27], which combines the network communication and the distributed computing together [28]. There are mainly three methods to distribute an optimization problem in the literature. The first one is to add explicit constraints by introducing auxiliary variables, and then solve it by the method of multipliers (MoM) [29, Section 3.4]. A widely applied variant of MoM algorithm is the alternative direction MoM (ADMoM) [29,30], which has been used to solve various distributed problems [3–6,8,9]. The second method is the incremental approach [22,31,32], where each node makes a local gradient descent, then relays the result to another node and repeats the procedure. As not requiring the global activity over network, the incremental approach cannot make concurrent computations on multiple nodes, which leads to a slow convergence rate. The third method is to integrate the consensus algorithm with local computations. Nedic et al. combined the Style II and Style III consensus with the local gradient descent in [33,34] respectively. Ref. [34] is the closest work to the distributed optimization part of this paper. However, the convergence rate of [34] is rather slow, since only one node may wake up in each slot. The contributions of this paper are mainly twofold. First, we propose a random broadcast gossip strategy, which is a kind of Style III consensus approach. As in [21], we take into account the possible collisions on the receivers due to the asynchronous behavior. We prove that our algorithm can reach UC in any connected network, which is a significant improvement of [19,21]. Second, we integrate the gradient descent with the random broadcast gossip, by which we obtain a fast distributed optimization method for a sum of convex functions. The rest of this paper is organized as follows. The problem statement and the algorithms are presented in Section 2. The convergence analysis of proposed algorithms is given in Section 3. Simulations with interpretations can be seen in Section 4. Finally, we conclude this paper with some remarks in Section 5.
2. Problem statement and algorithms The application scenario considered in this paper is as the following assumption: (A1) All nodes are working in half-duplex mode in a slottedtime shared channel, i.e., each node interchanges the roles as a transmitter or a receiver in slotted time, but cannot transmit and receive in the same slot. (A2) All nodes are equipped with omni-directional antennas with the same coverage radius, such that a pair of
214
H. Feng et al. / Signal Processing 98 (2014) 212–223
nodes can communicate with each other if and only if the distance between their locations is within the radius. (A3) The underlying graph G of the network is connected. There are two layers of topologies of a wireless network in this paper. The first layer of topology is the underlying graph G consisting of N nodes. By (A2), any two nodes within one-hop distance establish an edge in G linking with each other. The N N adjacent matrix A of the undirected G is defined by entries: ( 1 there is an edge linking node j with node i; aij ¼ 0 otherwise: The N N degree matrix D of the graph G is defined as a diagonal matrix, where the i-th diagonal entry di is the number of neighbors of node i in G. The second layer of topology is a time-varying graph Gk corresponding to the slot k. In this paper, each node broadcasts its data to neighbors following the independent and identical Bernoulli process with probðtransmitÞ ¼ p and probðreceiveÞ ¼ 1 p. Possible receiving failures may occur in two circumstances. First, there is a collision when two or more packets arrive at the same node in the same slot. Second, a packet arrives at a node which is just in transmitting mode. As shown in Fig. 1, among the three neighbors of node 2, only node 3 hears the broadcast successfully, and others fail due to either a collision or being in transmitting mode themselves. In order to describe the time switching links of graph Gk , we define the time-varying adjacency matrix Ak of Gk by rows: ( T ej if i’j succesfully; k rowi ðA Þ ¼ all 0 otherwise; where eTj is a row vector with all N elements as zeros but the j-th position as 1. The condition “i’j successfully” means that node i hears the broadcast from neighboring node j successfully, which only occurs when node j broadcasts while node i is in receiving mode. We also define the k in-degree matrix Dk of Gk with the diagonal element di : ( 1 if i’j succesfully; k di ¼ 0 otherwise:
Node in transmitting mode Node in receiving mode
The nodes hearing the broadcast successfully will update their values via a linear mixing operation with the mixing coefficient γ i , and those failing to hear the broadcast will reserve the original data. Then we have the first algorithm listed in Algorithm 1. Main result1: We prove that the following algorithm C-RBG makes the whole network reach UC as Definition 2. Algorithm 1. Consensus via random broadcast gossip (C-RBG). For slot k ¼0, 1, 2, … Each node broadcasts with probability p. For node i¼ 1, 2,…, N If i’j successfully, xki þ 1 ¼ ð1 γ i Þxki þ γ i xkj
ð1Þ
Else xki þ 1 ¼ xki until some stopping criterion is met.
The mixing coefficient γ i is crucial to the average conservation in the whole network. We design γ i specifically as γ i ¼ ð1 pÞq di ;
ð2Þ
where q is a constant larger than the maximum degree of the graph G, i.e., q 4 maxi di , such that 0 o γ i o 1 . The intuitive behavior under such mixing rule is that the node having more neighbors would be more likely to accept its neighbors' data, rather than reserve its own data. In a regular graph where all nodes have the same number of neighbors, all the nodes have the same mixing coefficient. In Section 3, we can see that the design of γ i as (2) aims to overcome the various collision probabilities on nodes with various degrees, which is crucial to the unbiasedness of C-RBG. The second proposed algorithm is to solve the optimization problem with a sum of objective functions as N
min
f ðxÞ ¼ ∑ f i ðxÞ i¼1
s:t:
x A X ¼ ⋂X i ;
ðP1Þ
i
where f i : RM -R is the continuous convex objective function (but not having to be differentiable) of node i. The feasible set X i of node i is a nonempty, closed, and convex subset of RM . As shown in Algorithm 2, each node randomly broadcasts its value, and those successfully received nodes update their data like C-RBG, followed by a step of subgradient descent. Algorithm 2. Distributed gradient descent via random broadcast gossip (DGD-RBG). For slot k ¼0, 1, 2, … Each node broadcasts with probability p. For node i¼ 1, 2,…, N If i’j successfully,
1
2
5 4
vki ¼ ð1 γ i Þxki þ γ i xkj
ð3Þ
xki þ 1
ð4Þ
¼ P X i ðvki αki g ki Þ
Else xki þ 1 ¼ xki untilsome stopping criterion is met.
3 Fig. 1. Nodes 1, 2 and 5 are broadcasting, and nodes 3 and 4 are in receiving mode.
In (4), g ki is the subgradient of fi at vki , αki is the stepsize, and P X i is the projection operator onto the set X i of node i.
H. Feng et al. / Signal Processing 98 (2014) 212–223
Clearly, parallel operations are feasible in (3) and (4). For the well-definition of the subgradient descent of (4), we also assume that (A4) The norm of all subgradients in DGD-RBG is bounded, i.e., there exists a constant C such that J g ki J r C holds for all i; k. Let U ki be the event that node i is updated in slot k, and IcðÞ is the 0–1 indicator function. Each node counts the accumulated number of updates until slot k as uki ¼ ∑k IcðU ki Þ, and the stepsize for each subgradient descent is chosen as [34] αki ¼
1 : uki þ 1
ð5Þ
It can be seen that the stepsize as (5) is diminishing over the time. Main result2: We prove that DGD-RBG will make the whole network reach PC as Definition 2. At the same time, limk-1 xki ¼ xn , a.s., where xn A X n is an optimal solution of (P1).
215
where Rki’j is the event that node i successfully receives the broadcast from node j in slot k. The event occurs when one of the neighbors of node i broadcasts, and node i is in receiving mode. Correspondingly, we can get the probability of Dk by diagonal elements: k
probðU ki Þ ¼ probðdi ¼ 1Þ ¼ di pð1 pÞdi ;
ð10Þ
where U ki is the event that node i updates successfully in slot k, which has relation to the degree of each node in the undirected graph G. Now, we can define the expectation matrices for further discussion as D 9 E½Dk , A 9 E½Ak , L 9E½Lk 9D A , and W 9 E½W k . By these definitions, the expectation of (7) can be formalized as E½zk þ 1 jzk ¼ W zk , where W has the following properties. Lemma 1. Let assumptions (A1)–(A3) hold, we have (1) W ¼ I ɛL, where ɛ ¼ pð1 pÞq , and L ¼ D A is the Laplacian matrix of the undirected graph G; (2) W is non-negative and doubly stochastic, i.e., W 1 ¼ 1, and 1T W ¼ 1T ; (3) The eigenvalues of W can be sorted as
3. Convergence results
0 rλN ðW Þ r⋯ rλ2 ðW Þ o λ1 ðW Þ ¼ 1;
3.1. C-RBG
where λi ðW Þ ¼ 1 ɛλN i ðLÞ; i ¼ 0; 1; …; N 1, and 1 is the unique largest eigenvalue of W .
The iterations of all nodes as (1) can be formalized in a matrix form as X
kþ1
k
k
¼W X ;
ð6Þ
Proof. From (8), we can calculate W by rows as rowi ðW Þ ¼ eTi ð1 di pð1 pÞdi þ ∑ pð1 pÞdi ðð1 γ i ÞeTi þγ i eTj Þ jANi
where the i-th row of the matrix X k is the data of node i in slot k ðxki ÞT , and W k is the updating matrix in slot k. Consider (6) by columns, the linear dynamic system (6) can be described as zkmþ 1 ¼ W k zkm ;
ð7Þ
is the m-th column of X k , where which is a vector consisting of all nodes' m-th components. Since many conclusions drawn from the following discussions are the same for all components, we can omit the column suffix m for brevity without any confusion. From the description of C-RGB, the rows of W k are ( ð1 γ i ÞeTi þ γ i eTj if i’j successfully; rowi ðW k Þ ¼ ð8Þ eTi otherwise: zkm
¼ ðxk1;m ; xk2;m ; …; xkN;m ÞT
It follows that W k can be written in a matrix form as W k ¼ I ΓLk ¼ I ΓðDk Ak Þ, where Γ is a diagonal matrix with elements γ i , and Lk 9ðDk Ak Þ is the Laplacian matrix of Gk . It is obvious that W k is a nonnegative stochastic matrix, i.e., W k 1 ¼ 1, but is not doubly stochastic, because 1T W k a1T . From the random broadcast behavior with possible collisions discussed above, we can write out the concrete probability of each instance of Ak by rows: probðRki’j Þ ¼ probðrowi ðAk Þ ¼ eTj Þ ¼ pð1 pÞdi ;
ð9Þ
¼ eTi
ɛ ∑
jANi
ðeTi
eTj Þ;
ð11Þ
where N i is the index set of all neighbors of node i. It follows (11) that W ¼ I ɛL , which relates the timevarying updating matrix to the Laplacian matrix L of the time-invariant G. W is nonnegative due to all W k are nonnegative. The doubly stochastic property of W comes from the facts that L1 ¼ 0 and 1T L ¼ 0T for the Laplacian matrix of an undirected graph. The third result of this lemma can be derived from the analog analysis on the Perron matrix in [15], to which we refer in detail. □ Various existing random consensus algorithms can be rewritten in the form as (7). Some examples have been given in [18]. A useful extension of (7) is with i.i.d. link failure probability τ, where the updating matrix in mean will be W ¼ I τɛL. The following proposition provides a weak convergence result of the dynamic system (7), which can be analyzed from Lemma 1. A stronger convergence result will be given in Theorem 1. Proposition 1 (Convergence in mean, Aysal et al. [19]). Let assumptions (A1)–(A3) hold, we have E½ lim zk ¼ 1z 0 ; k-1
where z 0 ¼ 1T z0 =N:
216
H. Feng et al. / Signal Processing 98 (2014) 212–223
From the eigen-decomposition view to W , we have that E½zk -c1 and the decaying rate of E½zk depends on the second largest eigenvalue λ2 ðW Þ ¼ 1 ɛλN 1 ðLÞ. Since the eigenvalues of L only depend on the topology, we can choose a suitable ɛ to let λ2 ðW Þ as small as possible. Recall that ɛ ¼ pð1 pÞq , we suggest that the choice of q should be slightly larger than maxi di , but not too much. Corollary 1. The random sequence fz k g constitutes a martingale as E½z k þ 1 ∣z k ¼ z k : Proof. E½z k þ 1 jz k ¼ E½1T zk þ 1 =Njz k ¼ E½1T W k zk =N∣z k ¼ 1T T k
k
W z =N ¼ 1 z =N ¼ z , where we use the fact of Lemma 1 in the fourth equality. □ k
1T E½zk ¼ 1T z0 for any k, since 1T W ¼ 1T , which gives the UC result of C-RBG. □ To study the mean square convergence of the algorithm, we define the deviation vector as ξk ¼ zk 1z k :
ð12Þ
There are two convergence factors defined in existing literatures. Definition 3 (Asymptotic Mean Square Convergence Factor). The exponential rate of convergence is defined as Ra ¼ supξk a 0 limk-1 ðE½ J ξk J 22 jξk =J ξk J 22 Þ1=k . Definition 4 (Per-Step Mean Square Convergence Factor). The exponential rate of convergence is defined as Rs ¼ supξk a 0 E½ J ξk þ 1 J 22 = J ξk J 22 .
Now we want to evaluate the possibility of z k deviating from the initial average z 0 . Proposition 2 gives an upper bound of the deviation in a probabilistic view. Before the formal statement, we need a lemma to analyze the bound of all iteration values.
Since Lemma 1 in [36] claims that Ra rRs , we will only consider Rs in the following discussion. From (7) and (12), we have ξk þ 1 ¼ W k zk JW k zk ¼ Bk zk , where J ¼ ð1=NÞ11T and Bk ¼ ðI JÞW k . Note that for any stochastic matrix W k , there is Bk 1 ¼ ðI JÞW k 1 ¼ ðI JÞ 1 ¼ 0. It follows that
Lemma 2. Let assumptions (A1) and (A2) hold. For any k, we have J zk J 1 r J z0 J 1 .
ξk þ 1 ¼ Bk zk Bk 1z k ¼ Bk ξk :
Proof. From (1), all elements of zk are linear mixing of the elements of zk 1 , which implies that maxi jzki j rmaxi jzki 1 j, i.e., J zk J 1 r J z0 J 1 . Backward to slot 0, we prove the lemma. □
The following lemma depicts several important properties of Bk .
Proposition 2. Let assumptions (A1) and (A2) hold. Given any positive constant δ, we have ! δ2 k 0 probðjz z j ZδÞ rexp : 8kðγ max J z0 J 1 Þ2
1 1 T kþ1 1 z ∑ zk ¼ γ i zki þ γ i zkj ; N N ði;jÞ A Ak
where ði; jÞ A Ak denotes the node index of all successful transmission pairs in slot k. Since j γ i zki þ γ i zkj j r2γ i maxðjzki j; jzkj jÞ, we have jz k þ 1 z k j r2γ max J z0 J 1 ; where γ max ¼ maxi fγ i g, and Lemma 2 is used. Apply the Azuma–Hoeffding inequality [35] to the martingale defined in Corollary 1, we complete the proof as probðjz k z 0 j ZδÞ rexp
δ2 2kð2γ max J z0 J 1 Þ2
! :
(1) Rs ¼ λ1jS ðE½ðBk ÞT Bk Þ; (2) λ1jS ðE½ðBk ÞT Bk Þ o1.
3.2. DGD-RBG
Proof. From the iteration formula (1), we have z k þ 1 z k ¼
Lemma 3 (Aysal et al. [19] and Nedic [34]). Let assumptions (A1)–(A3) hold, and λ1jS ðE½ðBk ÞT Bk Þ be the spectral radius of E½ðBk ÞT Bk in the subspace S ¼ fz A RN j1T z ¼ 0g, we have
□
As other gradient-like algorithms, the stepsize rule of iterations is crucial to the convergence of DGD-RBG. The following lemma gives a bound of the stepsize defined as (5). It is worthy to note that the bound is not the same as that in [34], since the transmission behavior is different. Lemma 4. Let assumptions (A1) and (A2) hold. For any ~ there exists a constant K, such that for positive integer k, ~ ~ k Z K, 1=ððk þ kÞηmax Þ r αki r 1=ððk kÞη min Þ a.s., where ηmin ¼ mini di pð1 pÞdi , and ηmax ¼ maxi di pð1 pÞdi . Proof. Apply the strong law of large numbers to (5), there is lim ∑IcðU ki Þ=k ¼ lim uki =k ¼ di pð1 pÞdi a:s::
k-1 k
Theorem 1. Let assumptions (A1)–(A3) hold, C-RBG will achieve UC. Proof. For each W k , the diagonal entry wkii 4 0. Since W ¼ I ɛL, the underlying graph of W is strongly connected. From Corollary 3.2 in [18], we have that the dynamic system (7) will achieve PC. Further, we have
k-1
Thus, for any positive δ, there exists a constant K, such that for k ZK: juki kdi pð1 pÞdi j rδ: ~ pð1 pÞdi 1, we complete the proof by simple Let δ ¼ kd i substitutions. □
H. Feng et al. / Signal Processing 98 (2014) 212–223
217
For each component m, we have
The following lemma gives the estimation of the eigenvalues of E½ðW k ÞT W k , which will be used in the proof of Lemma 6.
E½ J W k zkm 1ym J 22 ¼ E½ JW k zkm J 22 2E½ðW k zkm ÞT 1ym þ J 1ym J 22
r λ1 ðE½ðW k ÞT W k Þ J zkm J 22 2ðzkm ÞT W 1ym þ J 1ym J 22 Lemma 5. Let Assumptions (A1) and (A2) hold, we have λ1 ðE½ðW k ÞT W k Þ r 1.
By applying Lemmas 1 and 5, we have
M
m¼1
Reordering the sum by component index m, we complete the proof. □ The following lemma is the fundamental tool to analyze the convergence of the stochastic sequence generated by DGD-RBG.
Proof. Expand the variance by the sum of components, we have N
Lemma 7 (Prop. 4.2 in Bertsekas and Tsitsiklis [37]: Supermartingale convergence). Let ðΩ; F ; PÞ be a probability space and let ðF k F k þ 1 be a filtration of sub s-fields of F . Let fuk g, fr k g, and fwk g be three nonnegative sequences of random variables. Assume that ∑1 and k ¼ 0 wk o1 E½uk þ 1 jF k r uk r k þ wk hold a.s. Then the sequence fuk g converges to a nonnegative random variable and ∑1 k ¼ 0 r k o 1 a.s.
M
∑ E½ J vki y J 22 ¼ ∑
i¼1
M
m¼1
¼ ∑ J zkm 1ym J 22 :
Lemma 6. Let Assumptions (A1) and (A2) hold, the variance of all nodes after updating as (3) will not expand, i.e., 2 2 M k N k ∑N i ¼ 1 E½ J v i y J 2 r ∑i ¼ 1 J xi y J 2 , for any y A R .
N
M
m¼1
∑ E½ JW k zkm 1ym J 22 r ∑ ð J zkm J 22 2ðzkm ÞT 1ym þ J 1ym J 22 Þ
Proof. Since 0 rγ i r1 in (2), W k is nonnegative, so is E½ðW k ÞT W k . Since W k is stochastic, we have E½ðW k ÞT W k 1 ¼ E½ðW k ÞT 1 ¼ ðW ÞT 1 ¼ 1. By applying the Gershgorin circle theorem to E½ðW k ÞT W k , we claim that all eigenvalues of E½ðW k ÞT W k lie in the disk ⋃i jλ diagi ðE½ðW k ÞT W k Þj r ∑j a i ðE½ðW k ÞT W k Þij , which completes the proof. □
∑ E½ J vki;m ym J 22
i¼1m¼1 M
¼ ∑ E½ J W k zkm 1ym J 22 ; m¼1
where ym is the m-th component of y.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
Fig. 2. Three kinds of network: (a) random geometric network. (b) tree network. (c) cycle network.
1
218
H. Feng et al. / Signal Processing 98 (2014) 212–223
Proposition 3 claims that the stochastic sequences fxki g and fvki g of all nodes will reach the consensus of the average value in slot k. Theorem 2 claims that the consensus result lies in the optimal solution set of the problem (P1). The proofs of two propositions (see the Appendices) are analog to [34], while details are different due to different broadcast behaviors between two papers.
with 5 levels. In the cycle network, the nodes are connected one by one as a ring. The simulation results include two parts. In the consensus part, we investigate the average bias besides the consensus results. In the distributed optimization part, we validate that our algorithm can reach a consensus of some optimal solution.
Proposition 3. Let assumptions (A1)–(A4) hold, for DGDRBG, we have
4.1. C-RBG
(1) (2)
limk-1 J xki x k J 22 limk-1 J vki x k J 22
¼ 0 a.s., ¼ 0 a.s.
Theorem 2. Let assumptions (A1)–(A4) hold, and assume X n of (P1) is non-empty, the stochastic sequences fxki g generated by DGD-RBG converge to some xn A X n a.s. 4. Simulations We validate our algorithms in various connected undirected networks, three of which are a random geometric network, a tree network, and a cycle network as shown in Fig. 2. Each topology contains 50 nodes. In the random geometric network, the nodes are uniformly distributed in a unit square. There is an edge linking any pair of nodes within the transmission radius, which is set to be 0.08. In the tree network, the nodes form a hierarchy topology
For the convenience of visualization, the dimension of the nodes values is set to be 1, i.e., xki A R; i ¼ 1; 2; …; 50. The initial value for each node is randomly generated with a uniform distribution within ½0; 1. We compare C-RBG with two state-of-the-art broadcast gossip algorithms for consensus. One is the broadcast gossip algorithm (BGA) [19], where only one node broadcasts in each slot, so that there is no collision problem. The other is the collision broadcast gossip algorithm (CBGA) [21], where each node broadcasts with probability p in each slot. In both algorithms each node updates its value by (1) with a fixed mixing coefficient γ. We choose three different values 0.2, 0.5 and 0.8 of the mixing coefficient in BGA and CBGA to evaluate the performance of the algorithms. We first present the convergence results of C-RBG, BGA and CBGA. For C-RBG, we set q ¼ maxi di þ 0:001 for the random geometric network, and q ¼ maxi di þ5 for the tree network and the cycle network. In all these cases, we set p¼ 0.07 both for C-RBG and CBGA, so that the
10−1
Variance of Nodes
Variance of Nodes
10−1
10−2
10−3
10−4
10−2
10−3
10−4 0
50
100
150
200
250
0
300
200
400
600
800
1000
1200
1400
Number of Iterations
Number of Iterations
Variance of Nodes
10−1
10−2
10−3
10−4 0
500
1000
1500
2000
Number of Iterations Fig. 3. Average variance of nodes versus time slots: (a) random geometric network, (b) tree network and (c) cycle network.
H. Feng et al. / Signal Processing 98 (2014) 212–223
219
cycle network. The reason is that the cycle network is a kind of regular graph. From (2), (9) and (10), it can be seen that both the collision probability and the mixing coefficient are identical for all nodes. Thus, CBGA become equivalent to C-RBG in networks with the topology of a regular graph, where both CBGA and C-RGB can achieve unbiased consensus. In the random geometric network and the tree network, the degrees of all nodes are not the same. The collision probabilities for all nodes are not identical either. However, CBGA applies the same mixing coefficient to all nodes, which results in the bias of CBGA especially, when the broadcast probability becomes larger. For C-RBG, the bias is very small regardless of network topology and broadcast probability, which validates the unbiasedness of C-RBG in various topologies.
communicational complexity is expectedly comparable. The divergence of the whole network in slot k can be measured by the average node variance as Var k ¼ k 2 k N 1 ∑N i ¼ 1 ðxi x Þ . Fig. 3 shows the evolution curve of the variance versus slots in different networks, where the y-axis is the average variance of 1000 simulation instances from independent random initial values. Fig. 1 shows that C-RBG converges in almost the same rate with CBGA. Both C-RBG and CBGA converge exceedingly faster than BGA in late iterations, because there may be multiple transmissions in single slot for C-RBG and CBGA. Fig. 4 shows the average numbers of slot needed to reach a certain variance s ¼ 10 3 for CBGA and C-RBG with various broadcast probabilities, which is also averaged from 1000 simulations. We notice that there is an optimal broadcast probability to minimize the iteration slots for C-RBG. With too small p, the gain concurrence is not significant. With too large p, severe collisions potentially degrade the performance gain of the algorithms. Besides the speed of convergence, we also concern about the bias of the consensus results from the initial average. The bias at the k-th slot is defined as Biask ¼ ∑i xki =N x 0 . The y-axis of Fig. 5 is the absolute value of the bias at the slot k ¼500 averaged over 104 times simulations from the same initial values. Evidently, in Fig. 5, the consensus results of CBGA are biased, except that in the
4.2. DGD-RBG As mentioned in introduction, DGD-BGA in [34] is the closest work to our algorithm, which is chosen for comparison in this subsection. We use DGD-RBG and DGD-BGA to solve the same following distributed convex optimization problem from [34]: N
min f ðxÞ ¼ ∑ f i ðxÞ;
x A \ iXi
i¼1
2200
700
2000 1800
Iteration Number
Iteration Number
600 500 400 300
1600 1400 1200 1000 800 600
200
400 100 0.02 0.04 0.06 0.08
200 0.1
0.12 0.14 0.16 0.18
0.2
0.05
0.1
0.15
0.2
0.25
0.3
Broadcast Probability
Broadcast Probability 5000 4500
Iteration Number
4000 3500 3000 2500 2000 1500 1000 500 0.05
0.1
0.15
0.2
0.25
0.3
Broadcast Probability Fig. 4. Average numbers of iterations needed to reach the variance s ¼ 10 3 : (a) random geometric network, (b) tree network, (c) cycle network.
220
H. Feng et al. / Signal Processing 98 (2014) 212–223
a
b 0.015
0.025
Absolute Average Bias
Absolute Average Bias
0.02
0.015
0.01
0.01
0.005
0.005
0 0.02
0 0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.02 0.04 0.06 0.08
Absolute Average Bias
c
0.1
0.12 0.14 0.16 0.18
0.2
Broadcast Probability
Broadcast Probability 0.015
0.01
0.005
0 0.02 0.04 0.06 0.08
0.1
0.12 0.14 0.16 0.18
0.2
Broadcast Probability Fig. 5. Absolute average bias of nodes versus broadcast probability p: (a) random geometric network, (b) tree network and (c) cycle network.
1 Optimal Value
0.8
Value of Each Node
0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0
100
200
300
400
500
600
Number of Iterations
0.1 in all the networks. Fig. 6 shows an example of the evolution process of DGD-RBG in the random geometric network. It illustrates how the values fluctuate and tend towards a consensus state around the global optimal solution. We compare the convergence rate of DGD-RBG with DGD-BGA [34]. The stepsize for DGD-BGA is set to be the same as that for DGD-RBG as in Eq. (5). Define the target n k n function error as Err k ¼ N 1 ∑N i ¼ 1 ½f ðxi Þ f ðx Þ, where x denotes a global optimal solution of this convex optimization problem. The results are presented in Fig. 7, where the y-axis is the target function error averaged over 1000 simulation instances. It can be seen that DGD-RBG converges towards xn much faster than DGD-BGA in all the networks.
Fig. 6. An evolution instance of all 50 nodes values in the random geometric network for DGD-RBG.
5. Conclusions
where x A R; X i ¼ ½ 1; 1 and f i ðxÞ ¼ s2i x2 si x. fsi g are independent random variables selected from ½ 1; 1. The initial values x0i are all set to be zero. We set q ¼ maxi di þ0:001 in the random geometric network, and q ¼ maxi di þ 5 in the tree network and the cycle network, p is set to be
We have presented two distributed algorithms, C-RBG and DGD-RBG, via the random broadcast gossip, which explore the inherent broadcast property of wireless networks. Concurrent multiple transmissions in networks accelerate the convergence of both algorithms. Meanwhile, separating mixing coefficients of each node keeps the
H. Feng et al. / Signal Processing 98 (2014) 212–223
101
Target Function Error
Target Function Error
101
100
10−1
221
100
10−1 0
50
100
150
200
0
200
Number of Iterations
400
600
800
1000
Number of Iterations
Target Function Error
101
100
10−1 0
500
1000
1500
Number of Iterations Fig. 7. Average target function errors versus time slots: (a) random geometric network, (b) tree network and (c) cycle network.
average invariant in mean, which is crucial to the correction of DGD-RBG. We expect that our work could attract more interests on broadcast-based signal processing in-network. There are a series of points worthy of further discussions in theory. Link failure or noise can be discussed as in [18,3]. Possible transmission delay is also common in wireless networks, which will substantially affect the network dynamics [17,11]. We would also like to investigate the performance of more stepsize rules, such as the fixed stepsize [34] and adaptive stepsize rules. The energy consumption can be evaluated more thoroughly in various topologies. Besides the subgradient descent, we hope to introduce more efficient local calculation methods to accelerate the global convergence rate.
Appendix A. Proof of Proposition 3
Proof. Rewrite (3) and (4) in column style with the explicit column index m: zkmþ 1 ¼ W k zkm þskm : Here skm can be explicitly written by components as ( ðP X i ðvki αki g ki Þ vki Þm ; i A RxðkÞ k si;m ¼ 0; i2 = RxðkÞ where Rx(k) is the index set of the successfully updating nodes in slot k. Recall that ξkm ¼ zkm 1z km , we have ξkmþ 1 ¼ Bk ξkm þMskm , where M ¼ ðI JÞ. It follows that E½ J ξkmþ 1 J 2 rE½ J Bk ξkm J 2 þ E½ J Mskm J 2 :
Acknowledgments We thank the anonymous reviewers for their comments to improve the quality of this paper. We also thank Mr. Yong Nie for his professional work on LaTeX typesetting for this paper. The work was supported in part by the NSTMP of China (2012ZX03001007-003, 2013ZX03003006-003), the NSF of China (60972024), and the Ph.D. Programs Foundation of China (20120071110028).
ðA:1Þ
Since J M J 2 r 1, by (A.4) and Lemma 4, for any sufficiently large k, we have h i h i 1=2 E J Mskm J 2 r E J skm J 2 r ∑i A RxðkÞ ðαki g ki;m Þ2 r
NC ~ ðk kÞη min
;
where g ki;m is the m-th component of g ki .
ðA:2Þ
222
H. Feng et al. / Signal Processing 98 (2014) 212–223
For any ξkm A S, there is
that with probability 1,
pffiffiffiffiffiffiffiffi E½ J Bk ξkm J 2 r ðE½ J Bk ξkm J 22 Þ1=2 r λ1jS J ξkm J 2 ;
ðA:3Þ
max
where λ1jS denotes λ1jS ðE½ðBk ÞT Bk Þ for brevity. Insert (A.2) and (A.3) into (A.1), we have h i pffiffiffiffiffiffiffiffi NC E J ξkmþ 1 J 2 r λ1jS J ξkm J 2 þ : ~ ðk kÞη min Then we can construct a supermartingale as follows: pffiffiffiffiffiffiffiffi h i 1 1 λ1jS k 1 NC ; J ξm J 2 þ E J ξkmþ 1 J 2 r J ξkm J 2 ~ kþ1 kþ1 k kðk kÞη min where k Z K as required in Lemma 4. ~ Since ∑k NC=ðkðk kÞη min Þ o1 , it follows Lemma 7 that pffiffiffiffiffiffiffiffi 1 λ1jS k J ξm J 2 o 1; ∑k k þ1 which implies that with probability 1, lim J ξkm J 2 ¼ 0:
ðA:4Þ
k-1
The result of (A.4) implies that limk-1 zkm ¼ c1 a.s., which holds for any component index m, so part (1) is proved. The proof of part (2) is trivial since limk-1 vkm ¼ limk-1 W k zkm ¼ zkm . □ Appendix B. Proof of Theorem 2
Proof. Without loss of generality, we consider node i, for any xn A X n D X i , J xki þ 1 xn J 22 ¼ J P X i ðvki αki g ki Þ xn J 22 r J vki αki g ki xn J 22 ; where we use the projection property onto a convex set (Prop. 2.2.1 in [38]). By the subgradient property (Prop. 4.4.2 in [34]), we have ðg ki ÞT ðvki xn Þ Zf i ðx k Þ f i ðxn Þ C J vki x k J 22 : It follows that J xki þ 1 xn J 22 r J vki xn J 22 þ ðαki CÞ2 2αki ðf i ðvki Þ f i ðxn ÞÞ þ2αki C J vki x k J 22 : Summarizing all nodes' data and take expectation, for sufficiently large k, we have h i E ∑i J xki þ 1 xn J 22 jxk h i rE ∑i J vki xn J 22 þ ðαki CÞ2 h i 2αki ∑i f i ðvki f i xn þ 2αki C∑i E J vki x k J 22 r∑i J xki xn J 2
f ðx k Þ f ðxn Þ ~ ðk þ kÞη max
þ
NC ~ ððk kÞη
min Þ
2
þ2
f ðx k Þ f ðxn Þ o1; ~ k ¼ 0 ðk þ kÞη 1
∑
CNE½ J vki x k J 22 ; ~ ðk kÞη min
where we use the fact of Lemmas 4 and 6. k 2 k ~ From Lemma 7, we have ∑1 k ¼ 0 E½ J v i x J 2 =ððk kÞηmin Þ 2 ~ o 1. Combined with ∑1 NC=ððk kÞη Þ o1, it claims min k¼0
which implies that limk-1 f ðx k Þ ¼ f ðxn Þ , and the sequence k ∑i J xi xn J 22 will converge to a nonnegative random variables a.s. By Proposition 3, we have limk-1 f ðxki Þ ¼ f ðxn Þ , which implies that the sequence fxki g will converge to the optimal set X n . Since the convergence of f∑i J xki xn J 22 g holds for any xn A X n , it follows that fxki g generated by DGDRBG must converge to some xn of the problem (P1) a.s. □ References [1] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, Wireless sensor networks: a survey, Comput. Netw. 38 (4) (2002) 393–422. [2] A. Giridhar, P. Kumar, Toward a theory of in-network computation in wireless sensor networks, IEEE Commun. Mag. 44 (2006) 98–107. [3] I. Schizas, A. Ribeiro, G. Giannakis, Consensus in ad hoc WSNS with noisy links–part i: distributed estimation of deterministic signals, IEEE Trans. Signal Process. 56 (2008) 350–364. [4] I. Schizas, G. Giannakis, S. Roumeliotis, A. Ribeiro, Consensus in ad hoc WSNS with noisy links–part ii: distributed estimation and smoothing of random signals, IEEE Trans. Signal Process. 56 (2008) 1650–1666. [5] G. Mateos, I. Schizas, G. Giannakis, Distributed recursive leastsquares for consensus-based in-network adaptive estimation, IEEE Trans. Signal Process. 57 (2009) 4583–4588. [6] I. Schizas, G. Mateos, G. Giannakis, Distributed LMS for consensusbased in-network adaptive processing, IEEE Trans. Signal Process. 57 (2009) 2365–2382. [7] F. Xi, J. He, Z. Liu, Adaptive fast consensus algorithm for distributed sensor fusion, Signal Process. 90 (5) (2010) 1693–1699. [8] P.A. Forero, A. Cano, G.B. Giannakis, Consensus-based distributed support vector machines, J. Mach. Learn. Res. 11 (2010) 1633–1707. [9] J.F.C. Mota, J.M.F. Xavier, P.M.Q. Aguiar, M. Puschel, Distributed basis pursuit, IEEE Trans. Signal Process. 60 (4) (2012) 1942–1956. [10] X. Wang, L. Cheng, Z.-Q. Cao, C. Zhou, M. Tan, Z.-G. Hou, Outputfeedback consensus control of linear multi-agent systems: a fixed topology, Int. J. Innov. Comput. Inf. Control 7 (5(A)) (2011) 2063–2074. [11] R. Yang, P. Shi, G.-P. Liu, H. Gao, Network-based feedback control for systems with mixed delays based on quantization and dropout compensation, Automatica 47 (12) (2011) 2805–2809. [12] Y. Wakasa, K. Tanaka, Y. Nishimura, Distributed output consensus via LMI-based model predictive control and dual decomposition, Int. J. Innov. Comput. Inf. Control 7 (10) (2011) 5801–5812. [13] X. Su, L. Wu, P. Shi, Sensor networks with random link failures: distributed filtering for t-s fuzzy systems, IEEE Trans. Ind. Inf. 9 (3) (2013) 1739–1750. [14] W. Ren, R.W. Beard, E.M. Atkins, A survey of consensus problems in multi-agent coordination, in: Proceedings of 2005 American Control Conference (ACC 2005), 2005, pp. 1859–1864. [15] R. Olfati-Saber, J.A. Fax, R.M. Murray, Consensus and cooperation in networked multi-agent systems, Proc. IEEE 95 (1) (2007) 215–233. [16] D. Shah, Gossip algorithms, Found. Trends Netw. 3 (1) (2009) 1–125. [17] R. Olfati-Saber, R.M. Murray, Consensus problems in networks of agents with switching topology and time-delays, IEEE Trans. Autom. Control 49 (9) (2004) 1520–1533. [18] F. Fagnani, S. Zampieri, Randomized consensus algorithms over large scale networks, IEEE J. Sel. Areas Commun. 26 (4) (2008) 634–649. [19] T.C. Aysal, M.E. Yildiz, A.D. Sarwate, A. Scaglione, Broadcast gossip algorithms for consensus, IEEE Trans. Signal Process 57 (7) (2009) 2748–2761. [20] S. Boyd, A. Ghosh, B. Prabhakar, D. Shah, Randomized gossip algorithms, IEEE Trans. Inf. Theory 52 (6) (2006) 2508–2530. [21] F. Fagnani, P. Frasca, Broadcast gossip averaging: interference and unbiasedness in large Abelian Cayley networks, IEEE J. Sel. Top. Signal Process. 5 (4) (2011) 866–875. [22] A. Nedic, D.P. Bertsekas, Incremental subgradient methods for nondifferentiable optimization, SIAM J. Optim. 12 (2001) 109–138. [23] T. Erseghe, D. Zennaro, E. Dall Anese, L. Vangelista, Fast consensus by the alternating direction multipliers method, IEEE Trans. Signal Process. 59 (11) (2011) 5523–5537.
H. Feng et al. / Signal Processing 98 (2014) 212–223
[24] S. Kumar, F. Zhao, D. Shepherd, Collaborative signal and information processing in microsensor networks, IEEE Signal Process. Mag. 19 (2) (2002) 13–14. [25] J.B. Predd, S.B. Kulkarni, H.V. Poor, Distributed learning in wireless sensor networks, IEEE Signal Process. Mag. 23 (2006) 56–69. [26] A. Nedic, A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Trans. Autom. Control 54 (1) (2009) 48–61. [27] M. Rabbat, R. Nowak, Distributed optimization in sensor networks, in: Proceedings of IPSN, 2004, pp. 20–27. [28] Q. Zhao, A. Swami, L. Tong, The interplay between signal processing and networking in sensor networks, IEEE Signal Process. Mag. 23 (4) (2006) 84–93. [29] D.P. Bertsekas, J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice Hall, NJ, 1989. [30] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [31] E.S.H. Neto, A.R.D. Pierro, Incremental subgradients for constrained convex optimization: a unified framework and new methods, SIAM J. Optim. 20 (2009) 1547–1572.
223
[32] L. Li, J.A. Chambers, A new incremental affine projection-based adaptive algorithm for distributed networks, Signal Process. 88 (10) (2008) 2599–2603. [33] A. Nedic, A. Ozdaglar, P.A. Parrilo, Constrained consensus and optimization in multi-agent networks, IEEE Trans. Autom. Control 55 (4) (2010) 922–938. [34] A. Nedic, Asynchronous broadcast-based convex optimization over a network, IEEE Trans. Autom. Control 56 (6) (2011) 1337–1351. [35] M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 2005. [36] J. Zhou, Q. Wang, Convergence speed in distributed consensus over dynamically switching random networks, Automatica 45 (6) (2009) 1455–1461. [37] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996. [38] D.P. Bertsekas, A. Nedi, A.E. Ozdaglar, Convex Analysis and Optimization, Athena Scientific, 2003.