ARTICLE IN PRESS Signal Processing 90 (2010) 1197–1208
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
Distributed variational Bayesian algorithms for Gaussian mixtures in sensor networks B. Safarinejadian , M.B. Menhaj, M. Karrari Department of Electrical Engineering, Amirkabir University of Technology, 424 Hafez Ave., Tehran, Iran
a r t i c l e in fo
abstract
Article history: Received 12 February 2009 Received in revised form 26 September 2009 Accepted 10 October 2009 Available online 22 October 2009
This paper presents a distributed variational Bayesian (DVBA) algorithm for density estimation and clustering in sensor networks. It is assumed that measurements of the sensors can be statistically modeled by a common Gaussian mixture model. The variational approach allows the simultaneous estimate of the component parameters and the model complexity. The DVBA algorithm produces an estimate of the density of the sensor data without requiring the data to be transmitted to and processed at a central location. Alternatively, DVBA can be viewed as a distributed processing approach for clustering the sensor data into components corresponding to predominant environmental features sensed by the network. The convergence of the proposed DVBA is also investigated. The proposed method is then used for environmental monitoring and also distributed target classification. Simulation results approve promising performance of this algorithm. & 2009 Elsevier B.V. All rights reserved.
Keywords: Sensor networks Clustering Density estimation Mixture of Gaussians Variational approximations
1. Introduction Sensor networks consist of massively distributed devices that have some sensing and communication capabilities. They have a broad range of environmental sensing applications, including environmental monitoring, target tracking, collaborative processing of information, gathering data from spatially distributed sources, etc. [1,2]. Sensor networks can answer queries about the environment. Density estimation and unsupervised clustering are the central first step in exploratory data analysis [3,4]. Statistical learning can be used for density estimation and unsupervised learning. Due to recent advances in sensor networks and other network based structures such as computer networks on the internet, the need for efficient ways to deal with large amounts of data that are distributed over a set of nodes
Corresponding author. Tel.: þ98 9124306341; fax: þ98 21 66419728.
E-mail addresses:
[email protected],
[email protected],
[email protected] (B. Safarinejadian). 0165-1684/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2009.10.004
have been highlighted. In particular, in many data mining applications we are interested in learning a global model from such data, like a probability distribution or a clustering of the data, without transferring all the data to a central unit [5]. Therefore, distributed data mining (DDM) has recently emerged as an extremely important area of research. The objective, here, is to perform data mining tasks (such as association rule mining, clustering, classification) on a distributed database, that is, a database distributed across several sites (nodes) connected by a network. Research in this field aims at mining information from such databases while minimizing the amount of communication between nodes. For example, an algorithm for distributed association rule mining in peer-to-peer systems has been presented in [6]. K-means clustering has been extended to the distributed scenario in [5]. Techniques for performing non-parametric density estimation over homogeneously distributed data have been studied and experimentally evaluated in [7]. Recently, sensor networks have emerged as important systems in which distributed data mining has a crucial role. Sensor networks may be used for spatial data
ARTICLE IN PRESS 1198
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
analysis. Unknown environmental functions, such as temperature, air pressure, humidity, or light, can be approximated by using learning theory based approaches. For example, in [8,9], a Gaussian kernel function was adopted and the supervised learning was used for function approximation. These approaches can be used for estimating unknown environmental functions (temperature, air pressure, humidity, or light), or for tracking purposes, such as tracking a moving object. The applicability of graphical models and belief propagation to distributed inference in sensor networks has also been studied in [10,11]. The underlying probability distribution is represented as a graph, and making an inference proceeds as a sequence of massage-passing operations between nodes of the graph. In [11], such a belief propagation based inference mechanism has been applied to multivariate Gaussian distributions. The EM (expectation maximization) algorithm [12,13] is another inference approach that has been used for density estimation and data clustering. EM algorithm is a method of density estimation in which some of the variables are assumed to be missing or unobservable. Recently, there has been some research on distributed density estimation using the EM algorithm. A distributed EM algorithm has been developed in [14] for density estimation in sensor networks assuming that the measurements are statistically modeled by a mixture of Gaussians. In [15], a gossip-based distributed EM algorithm has been proposed for Gaussian mixture learning named Newscast EM, in which the E and M steps of the EM algorithm are first performed locally. The global estimate of means and covariances are then obtained through a gossip-based randomized method. In [16], another distributed EM algorithm was proposed for Gaussian mixtures in sensor networks in which a consensus filter has been used to diffuse local sufficient statistics to neighbors and estimate global sufficient statistics in each node. An important problem of the EM algorithm is that singularity may happen in the estimated parameters. Especially, if the model order is not properly selected and the assumed order is greater than the real order of the observed data, singularity will be inevitable. Furthermore, the model order is assumed to be known in these methods, which is a significant limitation. Recently, variational methods have gained popularity in the machine learning literature and have also been used to estimate the parameters of finite mixture models. The variational Bayesian method aims to construct a tight lower bound on the data marginal likelihood and then seeks to optimize this bound using an iterative scheme [17–22]. Furthermore, using variational methods, the order of mixture model may also be estimated. In other words, the variational approach allows the simultaneous estimation of the component parameters and the model complexity. This paper proposes a new framework for distributed data exploration in sensor networks. Density estimation and unsupervised clustering are central prime steps in exploratory data analysis. Both problems can also be naturally posed as maximum likelihood estimation problems, and have been widely studied under the assump-
tion that data are stored and processed at a central location. This assumption is not used in this paper; instead we assume that the data are not centralized, but rather are distributed across a collection of networked devices. Moreover, it is assumed that the cost (in terms of power or related resources) of computation at each node is much less than that of communication between nodes, which makes the option of centralized data processing very expensive and unattractive. In this paper, it is assumed that measurements of the sensors are statistically modeled by a common mixture of Gaussians. A distributed variational Bayesian algorithm (DVBA) is then developed for estimating model complexity and also the Gaussian component parameters simultaneously. The proposed DVBA can also be used as a general distributed data mining algorithm for density estimation and clustering of the data distributed over the nodes of a network. The DVBA algorithm can be viewed as a distributed implementation of an incremental variational approximation which is a fast alternative to the standard variational approximation for mixture density estimation; therefore its convergence rate is superior to that of the standard variational Bayesian algorithm. The proposed DVBA implements the variational approximation in a distributed structure, and it requires a modest communication cost. Convergence properties of the DVBA is also studied here and it is shown that the DVBA converges to a maximum likelihood point. Finally to approve the performance of the proposed DVBA algorithm a practical problem (multicamera target tracking) is considered in the paper. Knowing the fact that with the increasing use of camera surveillance in public areas the need for automated surveillance systems is still demanding, this benchmark study would help better highlight the merit of the proposed distributed algorithm. The rest of the paper is organized as follows. In Section 2, the basic problem statement is given, and observations and data models are defined. Section 3 reviews the variational approximations for mixture density estimation, and a distributed algorithm for its implementation is presented. Section 4 develops a fully distributed variational Bayesian algorithm that aims to reduce the number of iterations and hence communications required to compute the maximum likelihood estimate. Section 5 is devoted to convergence analysis of the DVBA. In Section 6, the proposed algorithm is used for environmental monitoring and also distributed target classification. Finally, Section 7 concludes the paper. 2. Problem statement Consider a network of M nodes and a d-dimensional random vector Ym with probability density function f ðym Þ which corresponds to node m. Each data (measurement) ym;i of node m is a realization of the random vector Ym. Assume that distribution of measurements is represented by a finite mixture of components: f ðym ; pm ; uÞ ¼
J X j¼1
pm;j fj ðym ; uj Þ
ð1Þ
ARTICLE IN PRESS B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
where pm ¼ fpm;j gJj¼1 are the mixture probabilities at node m, uj is the set of parameters defining the jth component and J is the number of mixture components. The mixture probabilities pm may be different at various nodes while the parameters uj are the same throughout the network. Here, we define u ¼ fuj gJj¼1 and y ¼ u [ fpm gM m¼1 . The set of data points of the mth node is represented by Y m ¼ m fym;i gN i¼1 in which Nm is number of observations at node m. It is assumed that measurements of each node are independent and identically distributed. m Consider a set of missing variables Z m ¼ fzm;i gN i¼1 J 1 m corresponding to Y m ¼ fym;i gN i¼1 . Each zm;i ¼ ½zm;i ; . . . ; zm;i
is a binary vector indicating by which component the data ym;i is produced. We would say ym;i is produced by the jth component of the mixture if for all raj, zrm;i ¼ 0 and zjm;i ¼ 1. The pair xm;i ¼ ðym;i ; zm;i Þ is regarded as the complete data and we write X m ¼ fY m ; Z m g in which m X m ¼ fxm;i gN i¼1 . While our approach can be applied to arbitrary models, for simplicity we consider here Gaussian component distributions, fj ðym ; uj Þ ¼ N ðym ; lj ; T j Þ, where lj is the mean and T j the precision (inverse covariance) matrix. In this paper, we develop a distributed variational Bayesian algorithm to estimate the parameters of this Gaussian mixture model using the data set Y ¼ fY m gM m¼1 . The introduced variables are defined briefly in Table 1. 3. The variational Bayesian inference approach Some variational methods for inference about mixture models have been recently proposed in the machine learning literature. A Bayesian framework based on the variational approximation was proposed in [23] for estimating the parameters and hyperparameters of a mixture model as an alternative to the maximum likelihood approach. In [17], the variational Bayes technique was extended to perform model selection as well as estimating parameters by introducing a prior over the model structure. For mixture models this leads to a posterior distribution over the number of components in the model. In [23,17] the connection between the EM algorithm and the variational Bayes algorithm was emphasized. Variational Bayes is an EM-like algorithm, the expectation step of EM corresponding to finding the expected value of the posterior of the component indicator variables in variational Bayes. The maximization step of EM relates to estimating the model parameters in
1199
variational Bayes by maximizing the lower bound on the marginal log-likelihood. In [21], variational methods were applied to the Bayesian analysis of mixture of Gaussian distributions. The above mentioned methods are all based on centralized data analysis while we will propose distributed approaches for density estimation using the measurements of a sensor network. In the following, the variational approach is first introduced and then it is shown that how it can be used for estimating the parameters of a Gaussian mixture model. Assuming that measurements are obtained through a sensor network, a two pass distributed variational Bayesian algorithm is also developed for estimating the mixture parameters.
3.1. The variational Bayesian algorithm Suppose we have observed the data set Y. Here, a parametric model with parameters h is assumed and missing or unobserved values are denoted by Z. Of interest is the posterior distribution of h given Y. The idea of the variational approximation is to approximate the joint conditional density of h and Z, by a more tractable distribution f^ Z;y ðz; hÞ, by minimizing the Kullback–Leibler (KL) divergence between the approximating density f^ Z;y ðz; hÞ and the true joint conditional density, fZ;yjY ðz; hjyÞ. The motivation for this is that we wish to obtain a tight lower bound on the log-evidence L ¼ log fY ðyÞ. We can establish a lower bound for L as follows: log fY ðyÞ ¼ log
Z X fY;Z;y ðy; z; hÞ dh z
Z X f ðy; z; hÞ f^ Z;y ðz; hÞ Y;Z;y dh ¼ log ^ f ðz; hÞ z Z;y
Z X f ðy; z; hÞ f^ Z;y ðz; hÞ log Y;Z;y dy Z f^ Z;y ðz; hÞ z
ð2Þ
The last term was obtained by Jensen’s inequality. This lower bound which is denoted by F, represents the variational free energy. The difference between the rightand left-hand sides of Eq. (2) is the KL divergence, given by Z X f^ ðh; zÞ f^ y;Z ðh; zÞ log y;Z dh KL½f^ y;Z ðh; zÞJfy;ZjY ðh; zjyÞ ¼ fy;ZjY ðh; zjyÞ z
Table 1 List of notation. Variable
Meaning
Variable
Meaning
M
Number of nodes of the network
m Z m ¼ fzm;i gN i¼1
Ym
A d-dimensional random vector corresponding to node m A data point (measurement) of node m
m A set of missing variables corresponding to Y m ¼ fym;i gN i¼1 The mixture probabilities at node m
J m;j gj¼1
pm ¼ fp
m Y m ¼ fym;i gN i¼1
The set of data points of node m
uj uj ¼ fuj gJj¼1
Y ¼ fY m gM m¼1
The set of all data points
The set of all parameters defining the mixture model h ¼ u [ fpm gM m¼1
ym;i
zm;i ¼ ½z1m;i ; . . . ; zJm;i A binary vector indicating by which component the N ðy; l; TÞ data ym;i is produced
The set of parameters defining the jth component The set of parameters defining all of the components A Gaussian distribution with mean l and precision (inverse covariance) matrix T
ARTICLE IN PRESS 1200
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
of parameters p. F then corresponds precisely the Bayesian information criterion (BIC) and the minimum description length (MDL) (see [18]). Thus, these popular model order selection criteria can be seen as a limiting case of the VB framework.
Since Z X f ðy; z; hÞ f^ y;Z ðh; zÞ log Y;Z;y log fY ðyÞ ¼ dh f^ y;Z ðh; zÞ z þ KL½f^ y;Z ðh; zÞJfy;ZjY ðh; zjyÞ We want the lower bound to be as close as possible to L and clearly, because of the KL divergence, maximizing the lower bound, (2), corresponds to minimizing the KL divergence. The KL divergence is actually minimized, with value zero if we take f^ y;Z ðh; zÞ ¼ fy;ZjY ðh; zjyÞ, but this does not simplify the problem. We require a f^ y;Z ðh; zÞ which provides a close approximation to the true joint conditional density and yet is simple enough to be computed. Usually f^ y;Z ðh; zÞ is restricted to have the factorized form f^ y;Z ðh; zÞ ¼ f^ y ðhÞf^ Z ðzÞ. The factors are chosen to minimize the KL divergence. The variational Bayes (VB) algorithm maximizes the negative free energy F via an EM-like strategy (see [17]). In the E-step, we compute the posterior over the hidden variables by solving @F =@f^ z ¼ 0 to get f^ Z ðzÞpexp½IðzÞ
ð3Þ
where
3.2. Application of the variational method for mixtures of Gaussian distributions Assume that observations of a sensor network are represented by the Gaussian mixture distribution introduced in Eq. (1). This section is devoted to apply the variational Bayesian method to estimate model complexity and parameters of this model. Here, conjugate priors are assigned to the parameters p and h. The mixing weights are assigned a Dirichlet prior distribution fpm ðpm Þ ¼ Dirðpm ; a0m;1 ; . . . ; a0m;J Þ The means are assigned independent multivariate normal conjugate priors, conditional on the covariance matrices, as J Y
fmjT ðmjTÞ ¼
IðzÞ ¼ Ey ½log fY;Zjy ðy; zjhÞ
0
N ðmj ; m0j ; ðbj Tj Þ1 Þ
j¼1
and the expectation is taken w.r.t. f^ y ðhÞ. In the M-step, rather than the optimal parameters, we compute the posterior distribution over the parameters by solving @F =@f^ y ¼ 0 to get
where m ¼ ðm1 ; . . . ; mJ Þ and T ¼ ðT1 ; . . . ; TJ Þ. The precision matrices are given independent Wishart prior distributions,
f^ y ðhÞp exp½IðhÞfy ðhÞ
fT ðTÞ ¼
ð4Þ
where IðhÞ ¼ Ez ½logfY;Zjy ðy; zjhÞ and the expectation is taken w.r.t. f^ Z ðzÞ. This is where the concept of conjugate priors becomes useful. We chose the prior fy ðhÞ from a family of distributions such that f^ y ðhÞp exp½IðhÞfy ðhÞ belongs to the same family. fy ðhÞ is then said to be conjugate to exp ½IðhÞ. In this case, learning in the VB framework simply amounts to transforming the prior parameters to the posterior parameters. Penalizing complex models: To see that the VB objective function F penalizes complexity, it is useful to rewrite it as " # fY;Zjy ðy; zjhÞ KL½f^ y ðhÞjjfy ðhÞ F ¼ EZ;y log f^ Z ðzÞ where the average in the first term on the RHS is taken w.r.t. f^ Z;y ðz; hÞ. The first term corresponds to an averaged likelihood and the second term is the KL distance between the prior and posterior over the parameters. As the number of parameters increases, the KL distance follows and consequently reduces F . This penalized likelihood interpretation becomes transparent in the large sample limit N-1, where the parameter posterior is sharply peaked about the most probable value y ¼ y^ . It can then be shown that the KL penalty reduces to p=2 log N, which is linear in the number
J Y
WðTj ; u0j ; S0j Þ
j¼1
Thus, the joint distribution of all of the random variables is given by fY;Z;y ðy; z; hÞ ¼ fY;Zjy ðy; zjhÞfp ðpÞfmjT ðljTÞfT ðTÞ 0
The quantities a0j , m0j , bj , u0j and S0j are called hyperparameters. For the variational approximation of fy;ZjY ðh; zjyÞ, we take f^ y;Z ðh; zÞ to have the factorized form f^ y;Z ðh; zÞ ¼ f^ y ðhÞf^ Z ðzÞ. Hyperparameters of the posterior distributions obtained using Eqs. (3) and (4) are f^ pm ðpm Þ ¼ Dirðpm ; am;1 ; . . . ; am;J Þ f^ mjT ðmjTÞ ¼
J Y
N ðmj ; mj ; ðbj Tj Þ1 Þ
j¼1
f^ T ðTÞ ¼
J Y
WðTj ; uj ; Sj Þ
j¼1 J
with hyperparameters G ¼ faj ; mj ; bj ; uj ; Rj gj¼1 given by
am;j ¼ að0Þ þ m;j
Nm M X X
qm;i;j
ð5Þ
m¼1 i¼1
bj ¼ bð0Þ j þ
Nm M X X m¼1 i¼1
qm;i;j
ð6Þ
ARTICLE IN PRESS B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
PM
ð0Þ bð0Þ j mj þ
mj ¼
PNm
m¼1
i¼1
qi;j yi
ð7Þ
bj Nm M X X
Rj ¼ Rð0Þ þ j
Nm M X X
ð0Þ
T
qi;j yi yTi þ bj mð0Þ mð0Þ bj mj mTj j j
ð8Þ
qm;i;j
ð9Þ
m¼1 i¼1
where
jtm;i;j
qtm;i;j ¼ Pj
j¼1
ð10Þ
jtm;i;j
t1
1 2
jtm;i;j ¼ p~ t1 ðT~ j Þ1=2 exp ðym;i mt1 ÞT E½T t1 ðym;i j j j Þ mt1 j
p~ j ¼ exp½
Z
!
d t1
ð2bj
ð11Þ Þ
f^ ðpÞ log pj dp
ð12Þ
Z T~ j ¼ exp½ f^ ðTj Þ logjTj j dTj
ð13Þ
E½Tj ¼ uj R1 j :
ð14Þ
Eqs. (5)–(9) can also be rewritten in a compact form as
atj;m ¼ a0j þ qtj
ð15Þ
btj ¼ b0j þ qtj
ð16Þ
mtj
Stj utj
¼
¼ ¼
b0j m0j þ atj
u0j
þ
þ
t bj
þ
T b0j m0j m0j
T btj mtj mtj
qtj
ð18Þ ð19Þ
in which the local sufficient statistics (or local summary quantities) are defined as qtm;j ¼
atm;j ¼
Nm 1 X qt Nm i¼1 m;i;j Nm X
ð20Þ
qtm;i;j ym;i
ð21Þ
qtm;i;j ym;i yTm;i
ð22Þ
i¼1
t
bm;j ¼
Nm X i¼1
M X
Nm qtj;m
ð23Þ
m¼1
atj ¼
M X m¼1
atj;m
ð25Þ
The variational approximation method leads to an automatic choice of model complexity as follows. The algorithm is first initialized with a number of components larger than one would expect to find. Then the iterations are executed until convergence is reached. In this case, if two or more Gaussians with similar parameters seem to be representing the same component of the data then one Gaussian will dominate the others causing their weights to approach zero, taken to be less than one observation in our approach, the component is removed from consideration and the algorithm continues with the remaining components until it converges to a solution. A distributed implementation of the standard variational Bayesian algorithm is obtained as follows. Assume that all nodes have the current hyperparameter estimate Ct . The estimated hyperparameter vector at the next iteration Ctþ1 can be computed by performing two message passing cycles through the nodes. The first message passing operation involves the transmission of the (partial) sufficient statistics from one node to another. The message passing follows a prescribed sequence through the nodes. For example, assume that the messages are passed from node to node in an order of 1; . . . ; M. To begin, initialize the sufficient statistics st ¼ t fqtj ; atj ; bj g to zero. Each node computes its local updates for the sufficient statistics according to (20)–(22). Note that t the local summary quantities stm ¼ fqtm;j ; atm;j ; bm;j g can be calculated locally given the current estimated hyperparameters set Ct and local data Y m ¼ fym;i g. In the forward path, in succession from 1; . . . ; M, node m increments st ¼ t fqtj ; atj ; bj g according to qtj ¼ qtj þ qtm;j
ð26Þ
atj ¼ atj þ atm;j
ð27Þ
t
t
t
bj ¼ bj þ bm;j t
ð28Þ t fqtj ; atj ; bj g
to node mþ1. This process takes and passes s ¼ the advantage of the fact that the process computing the global summary quantities can be divided into M separate computations followed by accumulation. At the last node, M, the global summary quantities for the M step are in hand. The global summary quantities vector st is passed back in the reverse order, from M; . . . ; 1. All M nodes now have the sufficient statistics to estimate Ctþ1 . The mean value of the parameters ltj , T tj and ptj at each node, are then obtained through E½ltj ¼ mtj E½T tj ¼ utj ðRtj Þ1
The global sufficient statistics are also defined as qtj ¼
t
bj;m
ð17Þ
btj S0j
M X m¼1
m¼1 i¼1
uj ¼ uð0Þ þ j
t
bj ¼
1201
ð24Þ
ptj ¼
qtj N
P where N ¼ M m¼1 Nm . This procedure is repeated until convergence is reached. Then the components whose weights are approximately zero will be removed and the
ARTICLE IN PRESS 1202
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
algorithm continues with the remaining components until it converges to a solution. Note that using the vector of sufficient statistics, each node can compute all of the required parameters based on the variational approximation method. Besides, since only this sufficient statistics vector is to be transmitted, the communication requirement of the proposed algorithm is quite modest. In other words, each iteration of this algorithm requires the transmission of 2ðM 1Þ messages of dimension, dimðst Þ. A description of the proposed algorithm is summarized below. (1) Initialize the algorithm with a number of components larger than one would expect to find. (2) Initialize the sufficient statistics vector st to zero. (3) In the forward path, do the following in succession from node 1 to node M. (a) Node m computes its local updates for stm according to (20)–(22). (b) Node m receives st from node m1 and updates st according to (26)–(28). (c) Node m passes the updated st to node mþ1. (4) In the backward path, do the following in succession from node M to node 1. (a) Node m computes the hyperparameter vector Ct using the global summary quantities obtained in the forward path. (b) Node m estimates the mean value of the parameters ltj , T tj and ptj . (5) If convergence is reached, go to step 6; otherwise, go to step 3. (6) If there are components with zero weight then remove those components and go to step 3; otherwise, stop.
E½T tj ¼ utj ðRtj Þ1 Then qtþ1 is computed using Eq. (10) and the local values m;i;j of the sufficient statistics vector is then calculated according to (20)–(22). Finally, node m updates its mixing t probabilities and sufficient statistic vectors qtj , atj and bj using
ptþ1 m;j ¼
qtm;j
ð29Þ
Nm
t ¼ qtj þ qtþ1 qtþ1 j m;j qm;j
ð30Þ
t atþ1 ¼ atj þ atþ1 j m;j am;j
ð31Þ
tþ1
bj
t
tþ1
t
¼ bj þ bm;j bm;j
ð32Þ
tþ1 fqtþ1 ; atþ1 ; bj g j j
The updated values are then transmitted to the next node and the above process is repeated. This procedure is repeated until reaching convergence. Then the components whose weights are approximately zero will be removed and the algorithm continues with the remaining components until it converges to a solution. In the proposed DVBA algorithm, the amount of variations of the log-likelihood function is considered as a stopping criterion. If this value is less than a certain threshold, the algorithm will stop. The value of the total log-likelihood is calculated as Lðht Þ
Nm M X X m¼1 i¼1
J X log ptm;j N ðym;i jltm;j ; T tm;j Þ j¼1
in which l and T tm;j are estimated mean and precision values at node m and iteration t. Here, after updating the parameters using the data of node m, the value of local log-likelihood function corresponding to this node is calculated using t m;j
4. A distributed variational Bayesian algorithm This section proposes a fully distributed variational Bayesian algorithm (DVBA) that eliminates the need for forward and backward message passing processes in the implementation of the standard variational Bayesian approximation discussed above. The DVBA cycles through the nodes of the network and implements the variational Bayesian estimation incrementally at each node using only the local data at each node and summary statistics passed from the previous node in the cycle. Similar to the standard VB algorithm [24], DVBA is guaranteed to converge to a local maximum as described in the next section. Moreover, in practice because of its incremental form, DVBA often converges much more rapidly than that of the standard VB algorithm. The proposed DVBA performs as follows. Initialize 0 fa0m;j g; fm0m;j g; fbm;j g; fn0m;j g and fS0m;j g at some chosen value. Assume that the algorithm proceeds in a cyclic manner. The following processing and communication are carried out at each node in sequence. At iteration t þ 1 node m t receives qtj , atj and bj from the immediate previous node. The node then computes the hyperparameters using Eqs. (15)–(19). The mean value of the means mj and the precisions Tj at this node are then obtained through E½ltj ¼ mtj
Lm ðht Þ
Nm X i¼1
J X log ptm;j N ðym;i jltm;j ; T tm;j Þ j¼1
The DVBA algorithm stops whenever the difference Lm ðhtþ1 Þ Lm ðht Þ becomes less than a convergence threshold e. Instead of likelihood variations, parameter variations can be also used as another stopping criterion. Note that this recursive equation requires that the loglikelihood value of node m at the previous iteration should be available at the current iteration. Assume that the nodes are distributed over a squared area with nodes on a uniform grid. Denoting Nb as the number of bytes communicated between two nodes per time step, it can be found that the amount of communications in bytes for the centralized method in which all nodes send the pffiffiffiffiffi their data topffiffiffiffi ffi center of the network becomes Mð1 þ 2 þ . . . þ M =2ÞNb ¼ OðM 3=2 Þ. The worst case in this method is that the centralized unit is not in the center of the network; in fact it is at the end of the network area. The communication in bytes for such a case is ðM 1 þ M 2 þ . . . þ 1ÞNb ¼ OðM 2 Þ. Once the centralized unit receives all data, it can run the standard VB algorithm. The proposed DVBA executes the communication and computations iteratively. The communication cost is
ARTICLE IN PRESS B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
1203
related to the number of loops, i.e., the accuracy of the estimated results. Denoting T as the number of loops, the communication in bytes for DVBA is MNb T ¼ OðMÞ. Therefore, unlike the centralized method the proposed DVBA is scalable. A description of the proposed algorithm is summarized in the following: (1) Initialize the algorithm with a number of components larger than one would expect to find. (2) Initialize the sufficient statistics st to zero. (3) Do the following in succession from node 1 to M. (a) Node m receives st from node m1 and computes the hyperparameters vector Ct according to (15)–(19). (b) The mean value of the means lj and the precisions T j at this node are calculated. (c) Node m computes its local updates for stm according to (20)–(22). (d) Node m updates its mixing probabilities and sufficient statistics vector st according to (29)–(32). (e) Node m passes the updated st to node mþ1. (4) If convergence is reached, go to step 5; otherwise, go to step 3. (5) If there are components with zero weight then remove those components and go to step 3; otherwise, stop.
5. Convergence analysis As mentioned before, in the variational approximation, the variational free energy F ðf^ z ; f^ y Þ constructs a lower bound on the log-evidence value L. At each iteration of the variational Bayesian algorithm, the posterior distributions of f^ z and f^ y are chosen such that the value of F is optimized. Note that f^ z and f^ y denote the density of the missing variables and the parameters vector h, respectively. The negative free energy may be written as Z X f ðy; z; hÞ f^ y ðhÞf^ Z ðzÞlog dy F ðf^ y ; f^ Z Þ ¼ ^ ðhÞf^ ðzÞ f Z y Z ¼ logf ðyÞ KLðf^ y Jfy Þ ¼ LðyÞ KLðf^ y Jfy Þ In which KLðf^ y Jfy Þ denotes the Kullback–Leibler divergence between f^ y ðhÞ and fy ðhÞ. Fig. 1 shows the above relation graphically. In the proposed DVBA, assuming that the data set of Q ^ different nodes are independent, we have f^ y ¼ M m¼1 f ym QM ^ ^ and f z ¼ m¼1 f zm . Therefore, the F function can be written as a sum of local variational free energy values over all of the network nodes. Assume that F m denotes the local variational free energy at node m, F can be written as: F ðf^ y ; f^ Z Þ ¼
M X
F m ðf^ ym ; f^ Zm Þ
Fig. 1. The quantity F provides a lower bound on the log likelihood LðyÞ, with the difference being given by the Kullback–Leibler divergence KLðf^ y Jfy Þ.
maximum value and that initial values of the parameters are sufficiently near the optimal values, the DVBA will eventually converge to its fixed point. In the proposed DVBA, at each node m, the value of f^ zm ðzm Þ is first calculated so that the value of F m is maximized with respect to f^ zm and then the value of f^ ym ðym Þ is calculated to maximize F m with respect to f^ ym . In fact, by updating the value of local sufficient statistics sm ¼ fqm;j ; am;j ; bm;j g using Eqs. (20)–(22), the value of F m is maximized with respect to f^ zm and it is also maximized with respect to f^ ym by calculating the hyperparameters using the Eqs. (5)–(9). Therefore, the DVBA algorithm cycles through the nodes and increases the value of F m at each node m, while the value of F n ; nam will not change. Hence, the total value of F will also increase. As a result, DVBA is a non decreasing algorithm and assuming that F is a local upper bound for F, this algorithm will finally converge to the optimal parameter values. 6. Simulation results This section is devoted to show the performance of the proposed DVBA algorithm. Two different cases have been considered as given below. 6.1. Case 1: Environmental modeling
m¼1
where Z X f ðY m ; Z m ; hm Þ f^ ym ðhm Þf^ Zm ðZ m Þlog F m ðf^ ym ; f^ Zm Þ ¼ dhm f^ ym ðhm Þf^ Zm ðZ m Þ Zm To improve convergence of the DVBA, it should be shown that the value of F is increased at each node until it reaches the maximum value. In other words, it should be exposed that F is a non-decreasing function in the DVBA. In this case, assuming that F represents a local
Here, we use a network of 100 nodes, each containing 100 data points, to evaluate performance of DVBA in environmental modeling. It is assumed that each node in the network senses an environment that can be described as a mixture of some elementary conditions. The measurements are thus statistically modeled with a mixture of Gaussians; each Gaussian component corresponds to one of the elementary conditions. We first consider a one dimensional data set simulated from a 4-component
ARTICLE IN PRESS 1204
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
Gaussian mixture given by 0:288N ð0; 0:04Þ þ 0:26N ð1:5; 0:25Þ þ 0:171N ð2:2; 11:56Þ þ 0:281N ð3:3; 0:25Þ We applied the proposed DVBA to this data set, initializing the program with 7 components (a value above the number of components we knew to be present), and the method automatically recovered the correct number of components for the model. In our approach, 0 the means m0j were all set to zero. The parameter b has chosen to be 0.05, to give a broad prior over the mean. The initial values for the degree of freedom u0j and the scale matrix S0j were taken to be 2 and 0, respectively. The mixing weights were given a Dirichlet prior with the initial a’s set to 0. The DVBA algorithm successfully estimated the correct number of components. The estimated parameters obtained through DVBA are presented in Table 2. The average values of estimated means, variances and the standard deviation of these parameters obtained for each node are presented in this table. As it is seen, very good estimates of the true values have been obtained. The small standard deviation values show that the estimated means and variances at all 100 nodes are almost the same. The true density of the simulated data set and density fitted by the DVBA at node 1 are also illustrated in Fig. 2. This figure shows that the estimated probability density function closely approximates the true one. Convergence criterion is assumed to be the local loglikelihood variations at each node. In other words, when likelihood variations become less than a convergence threshold e, the components with negligible weights are removed and the algorithm continues with the remaining components. This procedure is repeated until the DVBA converges to a solution. Here, the convergence threshold is assumed to be e ¼ 108 . Other simulations have shown similar results. A 2D data set is used for the next simulation. A sensor network with 100 nodes ðM ¼ 100Þ is considered again in which each node has 100 data observations ðNm ¼ 100Þ. The observations are generated from three Gaussian components ðJ ¼ 3Þ distributed in Fig. 3. Each of these components represents a 2D environmental data cluster. Eighty percent of observations of 40 nodes come from the first Gaussian component and other 20% of observations of these nodes come evenly from the other two Gaussian components. In 30 nodes, 80% of data points come from the second component and in the 30 other nodes, 80% of data points come from the third component. Other observations of these nodes come evenly from the other
two Gaussian components. The nodes are randomly chosen when assigning data points. Here again, we initialized the DVBA algorithm with seven components and it could automatically recover the correct number of components. Second elements of the estimated mean vectors in 100 nodes are shown in Fig. 4. It can be seen that the estimated mean values in all nodes are very close to their true values (the true values are 2, 0 and 2). The true and estimated parameters of the
Fig. 2. Distribution of the simulated data set from mixture of four normal components.
Fig. 3. Data distribution.
Table 2 Fitted means and variances using the DVBA. Component
Average of means
Std. of means ( 107)
Average of variances
Std. of variances ( 106)
1 2 3 4
0.0019 1.5072 2.2480 3.3040
0.0163 0.0385 0.8429 0.0557
0.0402 0.2556 11.2847 0.2565
0.0006 0.0104 0.6535 0.0236
ARTICLE IN PRESS B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
components using DVBA are presented in Tables 3 and 4, respectively. As seen, good estimates of the true values have been obtained. The values offered in these tables are the mean value of the estimated parameters at all nodes of the network. Standard deviations of the estimated parameters at various nodes are in order of 105. The small values of the standard deviations imply that the estimated values obtained at each node are almost the same. Here, the convergence threshold is assumed to be e ¼ 106 . Table 4 presents also the estimated parameter values using the standard VB algorithm. As it is seen, the estimated values are the same as the values obtained through the DVBA. Therefore, the DVBA has performed as good as the centralized VB and based on a distributed structure. Mean value of the estimated Gaussian densities
1205
by the DVBA and Gaussian densities estimated by the centralized VB algorithm are shown in Fig. 5. As it is seen, the estimated densities by these two algorithms are almost the same. True density of the simulated data is also shown in this figure. Another simulation is also considered here in which covariance matrix of the components are non-diagonal. A 2D data set generated from a 3-component Gaussian mixture model is used for this simulation too. Like the previous case, a sensor network with 100 nodes each containing 100 observations is considered. Observations of the nodes come evenly from the three components in this simulation. The algorithm was initialized with eight components and the correct number of components was recovered automatically. The true and fitted mean and covariance matrices obtained through DVBA are presented in Table 5. It is easily observed that the DVBA algorithm is capable of estimating properly the mixture parameters in this case too. 6.2. Case 2: Distributed target classification With the increasing use of camera surveillance in public areas, the need for automated surveillance solutions is rising. A particular problem is camera surveillance in wide areas, such as airports, shopping centers, etc. Such
Fig. 4. Three estimated mean values using DVBA in the network with 100 nodes.
Table 3 True mean and covariance matrices for the 2D data set. Component
Mean vector
Covariance matrix
1
[0,2]
2
[0,0]
3
[0,2]
2
0 2 0 2 0
0
0:2 0 0:2 0
Fig. 5. Estimated Gaussian densities by the DVBA algorithm (solid ellipses), the estimated densities by the standard variational Bayesian algorithm (dotted ellipses), and true density of the simulated data (dashed ellipses).
0:2
Table 4 Fitted mean and covariance matrices using DVBA and standard VB algorithms. Component
1
DVBA
Standard VB algorithm
Mean vector
Covariance matrix
Mean vector
Covariance matrix
[0.0003,1.9893]
[0.0004,1.9891]
2
[0.0293,0.0154]
3
[0.0521,1.9938]
2:0634
0:0111
0:0111 0:2077 2:0533 0:0288 0:0288 0:1625 1:9481 0:0192 0:0192
0:1975
[0.0297,0.0169] [0.0553,1.9944]
2:0912
0:0112
0:0112 0:2076 2:0582 0:0292 0:0292 0:1836 1:9480 0:0201 0:0201
0:1970
ARTICLE IN PRESS 1206
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
Table 5 True and fitted mean and covariance matrices using the DVBA. Component
True values
Estimated values
Mean vector
Covariance matrix
Mean vector
Covariance matrix
1
[0,2]
[0.0386,1.9879]
2 3
[0,0]
2 0:3 2
0:3 0:2 0:3
[0,2]
0:3 2
0:2 0:3
0:3
0:2
[0.0637, 0.0243]
2:0710 0:2975 2:0220
0:2975 0:1957 0:3011
[0.0070, 2.0108]
0:3011 2:0914
0:1944 0:3401
0:3401
0:2116
areas typically cannot be fully observed by a single camera, and surveillance of such places relies on a network of sparsely distributed cameras. In this particular setting the problem of tracking persons across all cameras is difficult. Someone is first observed by one camera, then he/she is outside of the sight of any camera, and later on he reappears at another camera. We would like to know whether the two observed persons are in fact the same individual. In this part, we employ DVBA in a distributed system for target recognition. In this system, each camera is a stand alone tracking unit, which stores its own observations and exchanges only limited data with other cameras. The local observations in combination with the exchanged data allow each camera to learn its own local model. Similar to other approaches [25] we use appearance cues such as average color, or length to find the correspondence between observations and persons. Since the same person will appear differently each time that she/he is observed, we model the observations as a stochastic process. We assume that the observations are samples drawn from a Gaussian distribution with person specific parameters, which are constant over time. In a system where J persons are monitored, observations of all persons are generated by a Gaussian mixture model (GMM) with J components. Each components of the mixture model corresponds to a specific person. Here, we consider the learning of the parameters of the GMM with the proposed DVBA. Given the learned GMM we assign the most likely person to each observation. We have performed a series of tests with the presented DVBA. The performance of the algorithm will be compared with that of a standard VB implementation. We evaluate the algorithms on artificially generated data. A set of 1000 observations from five persons, which are distributed over several cameras is used as a standard set. Every observation yi consists of a 3-dimensional appearance vector. The observations are randomly distributed over the cameras according to a uniform pdf. The difficulty of the generated data is measured by the c-separation and eccentricity values [26].
Definition 2. For a positive definite matrix S, let lmax ðSÞ and lmin ðSÞ refer to its largest and smallest eigenvalues, respectively, and denote eðSÞ the eccentricity of the matrix, pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi that is, lmax ðSÞ=lmin ðSÞ. An increasing difficult recognition problem is indicated by increasing eccentricity or decreasing c-separation values. The dataset has a c-separation of 1 and an eccentricity of 3. Several datasets are generated to investigate the performance of the algorithm by variations of the number of cameras, and the distribution of the observations over the cameras.
Definition 1. Two Gaussians Nðm1 ; S1 Þ and Nðm1 ; S2 Þ in Rn are considered c-separated if
In Fig. 6 the results of DVBA is shown with different number of cameras. The observations are distributed over the cameras according to a uniform distribution. The performance is compared with standard VB. The number of observations was fixed at 1000. The figure shows that the performance of DVBA is independent of the number of
Jm1 m2 JZc
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n maxðlmax ðS1 Þ; lmax ðS2 ÞÞ
where lmax ðSÞ is shortened for the largest eigenvalue of S.
Evaluation: The evaluation criteria should reflect two aspects of proper clustering. It is described that (i) all observations within a single reconstructed cluster belong to a single person, and (ii) all observations of a single person are grouped together in a single reconstructed cluster. These criteria are analogues to the precision and recall criteria often used in information retrieval settings. In order to evaluate both systems on one parameter we use the F1-measure defined in (13). Because the considered clustering problem is unsupervised, the true and proposed clusters are arbitrary ordered. Therefore, we define the precision (33) and recall (34) for the proposed cluster over the best match with a real cluster. Importantly precision and recall have to be considered jointly, because it is trivial to gain recall of 1. This is done by clustering all observations into one cluster. However, this will result in a very low precision. The F1 measure (35) is the harmonic mean of precision and recall and will penalize for such cases: Pr ¼
J 1 X maxi jC^ s \ Ci j J s¼1 jC^ s j
ð33Þ
Rc ¼
J 1 X maxs jC^ s \ Ci j J i¼1 jCi j
ð34Þ
F1 ¼
2 Pr Rc Pr þ Rc
ð35Þ
ARTICLE IN PRESS B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
cameras. As it is seen, performances of DVBA and standard VB algorithms are the same. In Fig. 7, number of iterations of both algorithms is illustrated with different number of cameras. As it is shown in this figure, convergence rate of the DVBA is much faster than that of the standard VB which is an important advantage. In another simulation, a uniform noise has been added to the observed data to study performance of both DVBA and standard VB algorithms under a realistic noisy model. Fig. 8 shows the value of F1-measure under this more realistic noisy situation. As it is seen, performance of both of the algorithms was degraded a little in the case of noisy observations.
DVBA standard VB
Number of iterations
100 80 60 40 20 0 5
10 15 20 Number of cameras
25
30
Fig. 6. The performance of DVBA and standard VB algorithms with different number of cameras.
DVBA standard VB
1.02 1 F1 Measure
0.98 0.96 0.94 0.92 0.9 0.88 5
10 15 20 Number of cameras
25
30
Fig. 7. The number of DVBA and standard VB iterations with different number of cameras.
DVBA standard VB
F1 Measure
0.9 0.8 0.7 0.6 0.5 10 15 20 Number of cameras
7. Conclusion In this paper, a distributed variational Bayesian algorithm (DVBA) was developed for density estimation and clustering in a sensor network. The variational approach allows the simultaneous estimation of the component parameters and model complexity. DVBA is a distributed algorithm that performs local computations on the sensor data at each node and passes a small set of sufficient statistics from node-to-node in the iteration process. Therefore, the DVBA’s communication requirements are quite modest. Meanwhile, convergence rate of the DVBA is superior to that of the standard VB algorithm due to its incremental nature. We have shown that DVBA converges to a stationary point of the log likelihood function, usually a local maximum. The proposed DVBA is a general distributed data mining algorithm for density estimation and clustering of the data distributed over the nodes of a network. Natural extensions of this work would involve exploring whether other distributed data mining problems can be formulated using the variational optimization framework, and whether the variational method leads to communication efficient techniques for solving these problems. These issues are currently under investigation and will be reported in near future. References
1
5
1207
25
30
Fig. 8. The performance of DVBA and standard VB algorithms under noisy observations.
[1] I. Akyildiz, W. Su, Y. Sankarasubramniam, A survey on sensor networks, IEEE Communications Magazine 40 (8) (2002) 102–114. [2] D. Estrin, R. Govindan, J. Heidmann, S. Kumar, Next century challenges: scalable coordination in sensor networks, in: Proceedings of the ACM/IEEE International Conference on Mobile Computing and Networking, Seattle, WA, 1999, pp. 263–270. [3] R. Kamimura, Cooperative information maximization with Gaussian activation functions for self-organizing maps, IEEE Transactions on Neural Networks 17 (4) (2006) 909–918. [4] Y. Sheng, X. Hu, P. Ramanathan, Distributed particle filter with GMM approximation for multiple targets localization and tracking in wireless sensor networks, in: Proceedings of the Fourth International Symposium on Information Processing in Sensor Networks, Los Angeles, CA, 2005, pp. 181–188. [5] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta, Distributed data mining in peer-to-peer networks, IEEE Internet Computing 10 (2006) 18–26. [6] R. Wolff, A. Schuster, Association rule mining in peer-to-peer systems, IEEE Transactions on Systems, Man and Cybernetics, Part B 34 (2004) 2426–2438. [7] C. Giannella, H. Dutta, S. Mukherjee, H. Kargupta, Efficient kernel density estimation over distributed data, in: Ninth International
ARTICLE IN PRESS 1208
[8]
[9] [10]
[11] [12]
[13]
[14]
[15]
[16]
[17]
B. Safarinejadian et al. / Signal Processing 90 (2010) 1197–1208
Workshop on High Performance and Distributed Mining, SIAM International Conference on Data Mining, 2006. X. Nguyen, M.I. Jordan, B. Sinopoli, A kernel-based learning approach to ad hoc sensor network localization, ACM Transactions on Sensor Networks 1 (1) (2005) 134–152. S. Simic, A learning theory approach to sensor networks, IEEE Pervasive Computing 2 (4) (2003) 44–49. M. Paskin, C. Guestrin, J. McFadden, A robust architecture for inference in distributed sensor networks, in: International Conference on Information Processing in Sensor Networks (IPSN), 2005. M.A. Paskin, C.E. Guestrin, Robust probabilistic inference in distributed systems, Uncertainty in Artificial Intelligence 20 (2004). A. Dempster, D. Rubin, Maximum likelihood estimation from incomplete data via the EM algorithm, Journal of Royal Statistical Society 39 (1977) 1–38. R. Neal, G. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in: M. Jordan (Ed.), Learning in Graphical Models, MIT Press, Cambridge, MA, 1999, pp. 355–368. R.D. Nowak, Distributed EM algorithms for density estimation and clustering in sensor networks, IEEE Transactions on Signal Processing 51 (2003) 2245–2253. W. Kowalczyk, N. Vlassis, Newscast EM, in: Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, MA, 2005. D. Gu, Distributed EM algorithm for Gaussian mixtures in sensor networks, IEEE Transactions on Neural Networks 19 (7) (2008) 1154–1166. H. Attias, Inferring parameters and structure of latent variable models by variational Bayes, in: Proceedings of the Fifth Conference on Uncertainty in Artificial Intelligence, 1999.
[18] H. Attias, A variational Bayesian framework for graphical models, in: T. Leen et al.(Ed.), Advances in Neural Information Processing Systems, vol. 12, MIT Press, Cambridge, MA, 2000. [19] C. Constantinopoulos, A. Likas, Unsupervised learning of Gaussian mixtures based on variational component splitting, IEEE Transactions on Neural Networks 18 (3) (2007) 745–755. [20] A. Corduneanu, C.M. Bishop, Variational Bayesian model selection for mixture distributions, in: T. Jaakkola, T. Richardson (Eds.), Artificial Intelligence and Statistics, Morgan Kaufman, Los Altos, CA, 2001, pp. 27–34. [21] C.A. McGrory, D.M. Titterington, Variational approximations in Bayesian model selection for finite mixture distributions, Computational Statistics & Data Analysis 51 (2007) 5352–5367. [22] W.D. Penny, S.J. Roberts, Variational Bayes for 1-dimensional mixture models, Technical Report PARG-2000-01, Department of Engineering Science, Oxford University, 2000. [23] S. Waterhouse, D. MacKay, T. Robinson, Bayesian methods for mixtures of experts, in: D.S. Touretzky et al.(Ed.), Advances in Neural Information Processing Systems, vol. 8, MIT Press, Cambridge, MA, 1996. [24] B. Wang, D.M. Titterington, Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model, Bayesian Analysis 1 (3) (2006) 625–650. [25] W. Zajdel, Bayesian visual surveillance—from object detection to distributed cameras, Ph.D. Thesis, Universiteit van Amsterdam, 2006. [26] S. Dasgupta, Learning mixtures of Gaussians, in: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, Washington, DC, USA, 1999, p. 634.