Distributed machine learning in networks by consensus

Distributed machine learning in networks by consensus

Neurocomputing 124 (2014) 2–12 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Distribute...

942KB Sizes 0 Downloads 92 Views

Neurocomputing 124 (2014) 2–12

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Distributed machine learning in networks by consensus Leonidas Georgopoulos a,n, Martin Hasler b a

Avenue de l'Eglise Anglaise 12, CH-1006 Lausanne, Switzerland School of Computer and Communication Sciences, École Polytechnique Fédéral de Lausanne (EPFL). EPFL IC ISC LANOS, Station 14, CH-1015 Lausanne, Switzerland

b

art ic l e i nf o

a b s t r a c t

Article history: Received 15 September 2011 Received in revised form 28 September 2012 Accepted 3 December 2012 Available online 8 April 2013

We propose an algorithm to learn from distributed data on a network of arbitrarily connected machines without exchange of the data-points. Parts of the dataset are processed locally at each machine, and then the consensus communication algorithm is employed to consolidate the results. This iterative two stage process converges as if the entire dataset had been on a single machine. The principal contribution of this paper is the proof of convergence of the distributed learning process in the general case that the learning algorithm is a contraction. Moreover, we derive the distributed update equation of a feed-forward neural network with back-propagation for the purpose of verifying the theoretical results. We employ a toy classification example and a real world binary classification dataset. & 2013 Elsevier B.V. All rights reserved.

Keywords: Distributed machine learning Parallel machine learning Gradient descent Consensus Peer-to-peer learning Neural networks

1. Introduction We consider the case of supervised machine learning, but in a distributed setting where the dataset is partitioned and a part resides at each machine in the communication network. We assume that for some reason, we are unwilling to communicate the data between machines, or gather it centrally for computation. Instead of just using the local data at each machine, we would like to learn from the entire dataset but without exchange of datapoints. This is accomplished by employing the consensus algorithm, which is briefly reviewed in this paper, Section 1.2. There are numerous occasions where the entire dataset may not be available. We conceive these as combinations of the following four basic cases. First, the dataset is too large to be handled by a single machine due to either hardware, software implementation, and or algorithmic limitations. Second, data is intrinsically distributed. That is the case when data is generated by a set of machines which acquire data by observation or examination (e.g. a set of sensors for environmental monitoring). Third, when data has to remain private but decisions are better performed globally (e.g. when working with clinical patient data). Finally, when data cannot be collected. Hence, it is inaccessible or its access is practically infeasible. This might be due to many reasons, among those just a few are communication costs and n

Corresponding author. Tel.: +41 786272711. E-mail addresses: [email protected] (L. Georgopoulos). martin.hasler@epfl.ch (M. Hasler). 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.12.055

failures, energy consumption, and machine downtime. A few of the possible applications where the problem arises are wireless sensor networks, data mining in large datasets, distributed databases, social networks, robotic applications, and inference from confidential or private data. 1.1. Problem layout For the rest of this paper we assume a dataset, partitioned and distributed over different machines in an arbitrary manner. Our purpose is to learn from the dataset such that any of the machines when presented with an example from the same generating process can successfully classify it. Moreover, we wish that the classification performed by any machine is identical and the performance equivalent to a centralised case. Moreover, this has to be achieved without exchanging any data-points between the machines. Specifically, any machine can communicate with other machines but not necessarily with every other machine. However, we require that the connection graph has no disconnected components. Obviously, disconnected components cannot be brought to agreement by means of communication algorithms. In our algorithm, the elementary learning process, risk computation and model update, is modified, and becomes a two phase process. The first phase consists of learning with the dataset available locally. This is performed simultaneously at every machine. Therefore, identical learning machines are trained locally but with different data drawn from the same generating process. In the second phase, the parameters of the model are

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

communicated to initiate the consensus algorithm and estimate the mean of the learned parameters. This iterated two phase process, has roughly the same effect as if a single classifier was trained on the entire dataset. The use of the consensus algorithm is certainly not a prerequisite. The necessary step is to compute the mean of the model updates at each step, and compute the new update with the computed mean at each machine. The proof that such a process converges as the non-distributed counterpart, in Section 2.1, is the main contribution of this paper. Such a computation, may in general be achieved with other communication protocols as well. One may consider from broadcasting to gathering the values to a central hub and then broadcasting the results back to the machines. Although these approaches have their predicaments, such as congestion, need for routing protocols, and increased administration complexity with respect to the number of nodes on the network and its topology, it is not in the interest of this study to compare these different approaches in depth. The main advantage of the consensus algorithm, which justifies its use in this study, is that the algorithm is so simple that it can be implemented in numerous scenarios; thus separating the theoretical analysis from the implementation details, without making the work inapplicable. Nonetheless, to demonstrate that distributed machine learning can be achieved in even the most simplistic scenarios of ad-hoc communication networks brings additional value to this work, and extends the list of possible applications. 1.2. Consensus algorithm Assume a network of machines coupled in an arbitrary manner. Let the communication graph be GðV; EÞ such that it is connected but not fully. Suppose, every vertex i in the graph has an associated scalar value xi ∈R. The consensus algorithm computes the arithmetic mean of these values at each vertex. This is possible just by local communication between connected vertices. The linear consensus algorithm as presented in [1] consists of a simple vertex-local update equation xi ðt þ 1Þ ¼ ∑j wij xj ðtÞ, where wij ∈Rþ are coefficients associated with the edges of the graph. Specifically, wij ≠0 when vertices i and j are connected and wij ¼ 0 otherwise. These coefficients guarantee the coherence of the local estimates. The global update equation is xðt þ 1Þ ¼ WxðtÞ

ð1Þ nn

where the elements of W∈R are the coefficients wij on the edges. The process is presented in Algorithm 1. Algorithm 1. Consensus Algorithm, Sðx; qÞ. 1: Execute the while loop for every i-th machine simultaneously 2: for t¼ 1 to q do 3: xi ←∑nj¼ 1 wij xj 4: end for

The convergence of the algorithm depends solely on the selection of these coefficients. Sufficient conditions for convergence are WT ¼ W; W1 ¼ 1; ρðW−11T =nÞ o 1 where 1 ¼ ð1; 1; 1; …; 1ÞT ∈Rn and ρðÞ denotes the spectral radius. This dynamical system has asymptotic convergence to the arithmetic mean ð1=nÞ11T x. Practically, the number of iterations q∈Zþ affects the precision of estimation of the mean and the level of agreement, [2]. Advantages of employing the consensus algorithm include and may not be limited to being inherently distributed and robust, having no need for routing tables, sub-network wide switching, and packet switching. In the simplest case, what is necessary, but

3

usually trivial to employ, is the need to determine time-slots only between networked neighbours. Moreover, the consensus algorithm can be very well applied to both digital and analog networks [3], and to LAN and WAN networks, with little effort. Moreover, it is so simple to employ that it permits the application of this work in even the simplest networks, such as ad-hoc wireless sensor networks, robot swarms, and simple peer-to-peer networks, without interfering with employed communication protocol stack. Finally, the algorithm needs no central coordination center, e.g. a switch to broadcast the packet, which distinguishes it among its counterparts; thus allowing it to operate in completely decentralised fashion. 1.3. Definitive consensus algorithm The main drawback for the application of the consensus algorithm is the large number of communications needed to reach consensus. This can be alleviated if the communication coefficients wij are switched in a timely manner [4]. This can be achieved with the definitive consensus algorithm that in fact permits the network to reach consensus in fixed and finite number of iterations. The coefficients can be obtained by numerically solving the equation below Wd Wd−1 …W1 ¼

11T n

ð2Þ

where d is the graph diameter, and Wd ; Wd−1 ; …; W1 are weight matrices corresponding to G. The solutions to this equation are easy to retrieve up to medium sized graphs with a numerical solver [4]. 1.4. Related work Related work with the problem at hand may be found in the field of distributed optimisation [5]. However, the interest of the researchers there is different, i.e. accuracy, quality of solution, convergence speed, violation of constraints, whereas in machine learning, it is generalisation, model selection, bias, and over-fitting. Thus optimisation cannot be considered to be equivalent to a learning problem. In the case of optimisation, the optimised function is known a priori. This allows for the computation of the Jacobian, the Hessian, and the retrieval of KKT conditions that designate the proper execution of the algorithm, and the quality of the obtained solution; especially of the sub-gradient computed with the consensus algorithm, which can be enforced to be within bounds by strict knowledge of qualitative benchmarks of the update step, such that the result is equivalent with the nondistributed case of evaluating the gradient. In contrast, in a machine learning process such facilities are not available since the model of the data is unknown, and one cannot evaluate the quality of a solution. Outlining, a central problem of statistical learning theory is if a family of learning models imposed by a learning algorithm is appropriate to model the data; to shatter the data [6]. In the distributed setting the data-points locally available are a subset of those globally available. A learning machine that may be able to adequately learn on a given data set, may not learn the true underlying model on a subset of the data. The problem that arises in distributed machine learning, is if by performing machine learning distributively, and combining the local results, permits to shatter the data, as in the case of the non-distributed counterpart. This central question, lies far from previous work, and especially from approaches in the distributed optimisation literature. We base our theoretical results on the fact that a large number of machine learning algorithms are contractions on sets of learning data. This approach however does not provide guarantees

4

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

for the quality of solution, which would certainly be an issue in optimisation literature. Notable attempts are found in the field of distributed support vector machines [7–9]. A notable study has been targeted in modifying the EM-algorithm for distributed computation [10]. An interesting study in the field, which bares specific interest for many applications, is [11] who has initially brought in our attention this field of research. Nevertheless, most of these studies differentiate from ours since they are specific to each of those learning algorithms. In contrast, we approach the matter in general and our theoretical results can be applied to a large class of algorithms, since the theoretical basis depends on the contraction principle, which is central to many learning algorithms. Moreover, unlike other studies, we do not exchange any of the data, and do not make explicit assumptions about the manner that the data is partitioned.

2. Theory Outlining, the algorithm for distributed learning has two phases. Each machine hosts an identical local learning algorithm. The first phase consists of the local training, e.g. a neural network, executed at each machine. The second phase is the execution of the consensus algorithm. There, the learned quantities are exchanged between machines while data is retained locally. Our purpose is to obtain the model as if the entire dataset was available locally at each machine. Suppose the dataset D is segregated in n non-overlapping subsets such that D ¼ ∪nk ¼ 1 k D. Each subset is k D ¼ fk X ; k Y g, where k

X ¼ fk x1 ; k x2 ; …; k xmk g is the subset of examples, and k xi ∈Rs is the ith example in the kth subset with s∈N. Similarly for the subset of k

class labels Y ¼ f y1 ; y2 ; …; ymk g, yi ∈f0; 1g is likewise the label associated with the ith example in the kth subset. Hence the entire dataset is D ¼ fk xi ; k yi g; k∈f1; 2; …; ng; i∈f1; 2; …; mk g, and mk is such that m ¼ ∑nk ¼ 1 mk is the number of examples in D. Furthermore, suppose that these datasets are distributed over a network of n machines, described by a graph GðV; EÞ such that each subset k D is related to one vertex. The problem at hand is: k

k

k

k

Problem 1 (Distributed learning). Given graph GðV; EÞ and a segregated dataset D such that k D is related with the kth vertex on the graph, with k∈f1; 2; …; ng, how can each machine learn a mapping f of examples to class labels in D without communicating any datapoint fk xi ; k yi g? Assume some non-distributed iterated learning algorithm L that learns on the dataset by minimising the empirical risk (Eq. (3)). The empirical risk for the entire dataset D is Remp ðD; f Þ ¼

1 n mk ∑ ∑ Qðk xi ; k yi ; f Þ mk¼1i¼1

ð3Þ

where Qðx; y; f Þ, Q : Rs  f0; 1g  F -R is a loss function, f ∈F is a classifier f : X -Y that maps examples to class labels, and F is a set of functions, the set of admissible classifiers. We assume a binary classification problem for simplicity, but the results are straightforward to extend to multi-class and regression problems. In general, many non-distributed learning algorithms, let L denote one, can be conceived as a simple iterative two step process. At the first step the empirical risk is computed. Subsequently, at the second step, an update of the mapping f is determined by some deterministic process A. The latter can be conceived as function AðD; f Þ, A : D  F -F , such that a new mapping fn is given. The n iterations proceed until a stopping criterion CðT ; e; f ; f Þ, C : T  N  F  F -f0; 1g becomes true (i.e. 1). Let T be another set like D, the validation set, and e is a positive integer, the iteration index. We

conceive the learning algorithm LðD; T Þ as a function L : D  T -F which takes as input some dataset from the same generating process and returns the learned model f. This process is summarised in Algorithm 2 where fn is the model update. Algorithm 2. Non-distributed learning algorithm LðD; T Þ. Initialise with some fn, t←0 repeat t←t þ 1 n Assign f ←f n f ←AðD; f Þ n until CðT ; t; f ; f Þ

1: 2: 3: 4: 5: 6:

However, in the distributed case the dataset is segregated over different machines, and the main problem lies in computing the update in step 5 of Algorithm 2 such that the empirical risk is reduced. The problem can be solved by determining at each iteration of L the update Δf of the classifier f by consensus Algorithm 1. Given a partition k D, the empirical risk is Rðk D; f Þ ¼

1 mk ∑ Qðk xi ; k yi ; f Þ mk i ¼ 1

ð4Þ

and the empirical risk on D is just the weighted mean over the empirical risk at each machine. RðD; f Þ ¼

1 n ∑ m Rðk D; f Þ nk¼1 k

ð5Þ

The updated classifier should usually be such that the empirical risk is reduced n

RðD; f Þ−RðD; f Þ ≤0

ð6Þ

but in the distributed setting this becomes n

Rðk D; f k Þ−Rðk D; f k Þ ≤0 n

ð7Þ k

where f k are the updates determined for each subset D. However, n this does not imply (Eq. (6)). Instead the updates f k have to be consolidated such that the empirical risk over the entire dataset is reduced. This is in fact the principal difficulty for distributed learning. n For this purpose, we employ consensus on the classifier f k updates. Hence, the update step is n

f ¼

1 n n ∑ f nk¼1 k

ð8Þ

Its convergence is treated in Section 2.1; we focus on describing the framework, and signaling some practical difficulties. Let us define the consensus learning process CLðD; T ; qÞ Algorithm 3 as a modification of the learning process L Algorithm 2 by executing all the steps locally and adding one more step for the computation of the update by consensus. Theoretically, the update be identical for all machines as a result of the consensus step 8. In step 6 of Algorithm 3, the update of f at each machine is computed as if the entire dataset was locally available. Therefore, step 6 is equivalent to the update step in L but not identical. The process is presented in n Algorithm 3, where f k is the model update of the kth machine, and f ¼ ðf 1 ; f 2 ; …; f n Þ is the vector of functions f 1 ; f 2 ; …; f n found at each vertex of the graph. Algorithm 3. Consensus learning algorithm CLðD; T ; qÞ. 1: 2: 3: 4: 5:

n

Initialise f k ; ∀k∈f1; 2; …; ng repeat e←e þ 1 n f k ←f k , ∀k∈f1; 2; …; ng for t ¼1 to l do

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

6: 7: 8: 9: 10: 11:

n

n

f k ←Aðk D; f k Þ; ∀k∈f1; 2; …; ng end for n n Run the consensus algorithm f ←Sðf ; qÞ n

ck ←Cðk T ; e; f k ; f k Þ; ∀k∈f1; 2; …; ng c←S~ ðcÞ until ck, ∀k∈f1; 2; …; ng

There are a two matters that need careful consideration for the practical application of Algorithm 3. First, the consensus process is slow in comparison to local computations. Second, the termination function might cause machines to stop asynchronously. Unless the consensus phase has converged, which usually requires n a large number of iterations, the estimation of the overall function f k at step 8 may be different among machines. Subsequently, these are n propagated in the update step 6. Therefore, the update f k can be different between machines. Mainly, this depends on two factors, the update function A and the precision attained at the consensus step 8. Another matter of consideration is that a learning process might require a large number of iterations to converge. Therefore, the entire process of slow consensus step and a single local learning step could become practically infeasible. However, this difficulty can be overcome, if we execute the local learning process for more iterations. Thus making fewer slow consensus step overall. We show in Section 2.1 that convergence is guaranteed for any combination of learning and consensus iterations. However, the performance of the algorithm depends on their combination among other matters, e.g. initialisation, learning step, etc. Though, the performance of any machine learning algorithm is in general sensitive to such parameters. Therefore, the combination of local learning steps and iterations for consensus should be carefully designed case by case. However, utilisation of the definitive consensus algorithm allows to partially overcome this problem. The second matter is the termination function C, which stops the algorithm. Even though in the non-distributed case this would be trivial, here this has some complications. As mentioned before, n the updated functions f k may be possibly different. Therefore, the criteria at some machines may be valid for termination but on other machines may have not reached the same decision. Put n n simply, Cðk T ; e; f k ; f k Þ ¼ Cðl T ; e; f l ; f l Þ cannot be guaranteed for all l≠k. In turn, this implies that not all machines terminate simultaneously. A solution is to run consensus on boolean decisions instead of scalar values. However, Algorithm 1 is not fit for this purpose. Either majority voting or unanimous decision can be employed. The choice of the method depends on many factors but most importantly the application at hand. Recent advancements in the field solve this problem for the case of majority voting when the number of machines is finite, see [12,13]. However, the discussion of this algorithm is beyond our purpose. We denote the consensus algorithm on boolean decisions as S~ ðxÞ in Algorithm 3. Each iteration of Algorithm 3 is referred to as an epoch e. At each epoch we distinguish three phases, local learning phase, global learning phase, and the termination phase. In the first phase the empirical risk is computed and the classifiers are updated locally for l iterations. In the global learning phase, the risk over the entire dataset is estimated by running consensus for q iterations. The final phase consists of computing the stopping criteria locally. Then termination can be decided by consensus on the decision value. The last step allows to terminate all machines simultaneously.

5

all data is processed at central site. In a number of cases we can also justify this theoretically. Case 1: The local learning rule A is a contraction for any set of learning data D: ∥AðD; f Þ−AðD; gÞ∥ ≤μD ∥f −g∥;

μD o1

ð9Þ n

for any pair of admissible classifiers f ; g∈F . This is e.g. the case of the Adaline learning rule. Eq. (9) holds also when A is a step in an optimisation algorithm for a smooth objective function. In this case, however, f and g must be restricted to the immediate basin of attraction of the same minimum of the objective function, and when combined with a consensus step in distributed learning, care needs to be taken that the classifier remains in this basin. The consensus algorithm acts on the vector ðf 1 ; f 2 ; …; f n Þ∈F n of classifiers: n

n

f k ¼ ∑ wkj f j j¼1

n

with ∑ wkj ¼ 1;

∀k

j¼1

ð10Þ

On F n we use the max-norm over the norm in F ∥ðf 1 ; f 2 ; …; f n Þ∥ ¼ max∥f k ∥

ð11Þ

k

Proposition 1. If all coefficients wkj of the consensus algorithm are non-negative, we get n

n

n

∥ðf 1 ; f 2 ; …; f n Þ∥ ≤ ∥ðf 1 ; f 2 ; …; f n Þ∥

ð12Þ

n

where f k is given by Eq. (10). Proof. n n n n ∥ðf 1 ; f 2 ; …; f n Þ∥ ¼ max∥f k ∥ k

n

¼ max∥ ∑ wkj f j ∥ k

≤max k

j¼1

!

n

∑ wkj ∥f j ∥

j¼1

"

≤max

n

∑ wkj

k

! max∥f j ∥

#

j¼1

j

¼ maxðmax∥f j ∥Þ ¼ max∥f j ∥ j

k

j

¼ ∥ðf 1 ; f 2 ; …; f n Þ∥

ð13Þ

Often, in a single step of the consensus algorithm, wkj ≥0 for all k; j. If we aggregate several steps, non-negative weights are even more likely, since ultimately the weights must converge to 1/n. We now combine local learning and consensus to a global learning step n

n

f k ¼ ∑ wkj Aðj D; f j Þ



j¼1

ð14Þ

Theorem 1. If the local learning rule satisfies Eq. (9) and the consensus step has non-negative coefficients, then the global learning step is a contraction in F n , and its iteration converges to a unique classifier that is independent of the initialisation learning process. Proof. By Proposition 1 applied to Eq. (14) n

n

n

∥ðf 1 −g n1 Þ; ðf 2 −g n2 Þ; …; ðf n −g nn Þ∥ 1

1

≤∥ðAð D; f 1 Þ−Að D; g 1 ÞÞ; …ðAðn D; f n Þ−Aðn D; g n ÞÞ∥ ¼ max∥Aðk D; f k Þ−Aðk D; g k Þ∥ k

2.1. Convergence

ð15Þ

employing Eq. (9), we get We have performed a large number of simulations which show that in general distributed learning by consensus gives good results, usually of similar quality as centralised learning, where

n

n

n

∥ðf 1 −g n1 Þ; ðf 2 −g n2 Þ; …; ðf n −g nn Þ∥ ≤max μk D ∥f k −g k ∥ k

ð16Þ

6

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

   ≤ max μk D max∥f k −g k ∥ k

1

¼ μ∥ðf 1 −g 1 Þ; ðf 2 −g 2 Þ; …; ðf n −g n ÞÞ

ð17Þ

where μ ¼ max μk D o 1 □

Convergence to a unique classifier.

Remark 1. Here, A is understood to be the aggregated local learning steps, and the weight matrix of the consensus step is also understood to be the aggregate of the elementary consensus steps. This makes it easier to justify the hypothesis, in particular the non-regularity of the consensus coefficients. Specifically, if a definitive consensus algorithm is applied, this condition is automatically satisfied. The classifier that is attained when the hypotheses of Theorem 1 are satisfied depends both on the local learning rule A and the consensus matrix ½wij . In the special case when many elementary steps of local learning are aggregated into A, then A produces the classifiers fk that would result when only the data k D were available. Combining these with the consensus step, and iterating, the classifier n

f k ¼ ∑ wkj f j

ð19Þ

j¼1

is reached asymptotically. If the consensus step is also aggregated from many elementary consensus steps, as if definitive consensus is used, then Eq. (19) becomes 1 n ∑ f nj¼1 j

ð20Þ

Case 2: Another case of interest is when local learning corresponds to empirical risk minimisation by gradient descent and definitive consensus (as a large number of elementary consensus steps) is used. In this case, distributed learning and centralised learning give the same result. In order to formulate this, we represent the set F of admissible classifiers by p real parameters a1 ; …; ap and instead of f we shall write f ða1 ;…;ap Þ and instead of fk, f ðk a ;…;k ap Þ . The empirical risk is 1

1 m ∑ Qðxi ; yi ; a; …; ap Þ RðD; a1 ; …; ap Þ ¼ mi¼1

ð21Þ

where Q is the loss function. It decomposes into local empirical risks n

RðD; a1 ; …; ap Þ ¼ ∑

k¼1

mk 1 mk ∑ Qðk xj ; k yj ; k a1 ; …; k ap Þ m mk j ¼ 1

n

mk k Rð D; a1 ; …; ap Þ k¼1 m

¼ ∑

ð22Þ

p

f ðk an ;…;k anp Þ ¼ f ða1 ;…;ap Þ −η ∑



q ¼ 1 ∂aq

Rðk D; a1 ; …; ap Þ

ð23Þ

Now we apply consensus with coefficients mk =m, i.e. iteration of the elementary consensus step or definitive consensus leads from an initial state xð0Þ to ð∑nk ¼ 1 ðmk =mÞxk ð0ÞÞ1, where 1 is the vector with all components equal to 1. This can be achieved with nonsymmetric elementary consensus matrices with left eigenvector v ¼ ðm1 =m; m2 =m; …; mn =mÞT or definitive consensus with a finite number of matrices whose product is 1vT . This leads to n

1

mk f ðk an ;…;k an Þ 1 p k¼1 m

Theorem 2. If the local learning step is obtained by gradient descent of the local empirical risk, and in the consensus step the linear combination of the local classifiers with the coefficients mk =m is reached, where mk is the amount of data at location k, and m is the total amount of data, then the combination of a local learning step and a consensus step is identical to a gradient descent step of the global empirical risk. Remark 2. If at all locations the same amount of learning data is available, then the consensus step is the usual average consensus. Remark 3. Theorem 2 can be generalised to local learning algorithms where the update of the classifier is the mean value of the updates for each learning sample, and this elementary update is defined through the loss function Q. 3. Application to feed forward neural networks Our purpose in this section is to demonstrate that this approach is applicable. To satisfy this, we have chosen an elementary machine learning algorithm. Moreover, we want to evaluate the theoretical results that the distributed algorithm converges as the non-distributed counterpart. Finally, our interest is focused in exploring effects of the distributed learning process which may have not been covered by theoretical analysis. We believe that an elementary machine learning algorithm may not be hindered by the algorithm's complexity, and allow to effortlessly understand, observe, and analyse the distributed learning process. This application falls within case 2 of the proof found in Section 2.1. A description of this type of neural networks is given in a number of books with the most prevalent being [14]. Though, we remind that the inner product of the parameters alj and the input vector ulj of the jth neuron at the lth layer is endowed with a nonlinear activation function g. Each neuron on the network realises a function hðxlj ; alj ; bÞ ¼ gððulj ÞT alj Þ þ b. The learning process of the neural network is governed by the following set of equations, referred as the backpropagation rule. The update of the synaptic coefficients ajl upon presentation of a learning sample ðxi ; yi Þ is given by ∂Qðxi ; yi ; hÞ ∂h

ð26Þ

T l−1 l−1 l δl−1 ¼ g′ððul−1 j j Þ aj Þ∑akj δql

ð27Þ

Δaljq ¼ −ηδlj uljq

ð28Þ

q

1

f ðan ;…;anp Þ ¼ ∑

which is exactly the formula for centralised gradient descent. This can be formulated as a theorem.

δj ¼ g′ðuT aj Þ

The local learning rule when starting from identical classifiers f ðk a ;…;k ap Þ ¼ f ða1 ;…;ap Þ obtained by previous consensus is 1

ð25Þ

ð18Þ

k

fk¼

p mk ∂ η ∑ Rðk D; a1 ; …; ap Þ q ¼ 1 ∂aq k¼1 m n

f ðan ;…;anp Þ ¼ f ða1 ;…;ap Þ − ∑

k

ð24Þ

where δlq is the error contribution of the qth neuron at the lth layer, g′ is the derivative of the activation function, Δaljh is the parameter adjustment, and η is a small constant, the learning rate. The first equation defines the error contribution of the last layer's output, before the output neuron. The second equation defines the error contribution of the jth neuron of the layer indexed l−1 by back-propagating the error contributions of neurons at the lth layer. The summations are done over the parameters of the input connections al−1 qj of the jth neuron on the layer indexed ðl−1Þ. Then Eq. (28) allows the parameter updates of every connection on the network. The reader is directed to [14] for a concise presentation of the algorithm.

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

In the setting of distributed consensus, it is better to consider batch updates. m

δ ¼ g′ðuT aÞ ∑

i¼1

∂Qðxi ; yi ; hÞ ∂h

ð29Þ

In view of the partitioned dataset and batch updates, the error of the last layer before the output neuron is modified as follows. ∂Qðk xi ; k yi ; hÞ ∂h k¼1i¼1 mk

n

δ ¼ g′ðuT aÞ ∑ ∑

ð30Þ

However, at each kth machine the error that is evidently ∂Qðk xi ; k yi ; hÞ ∂h i¼1 mk

½δk ¼ g′ðuT aÞ ∑

ð31Þ

and δ ¼ ∑nk ¼ 1 ½δk . Therefore the parameter update is just the sum of the partial weight updates determined at each machine. The computation of the parameter update rule (Eq. (32)) is straightforward in a distributed fashion by consensus as in Algorithm 1. The consensus update equation is given in Eq. (33), where, ½Δaljh k is the parameter update at the kth machine. n

Δaljh ¼ −η ∑ ½Δaljh k

ð32Þ

k¼1 n

½Δaljh i ← ∑ wik ½Δaljh k

ð33Þ

k¼1

7

As described in Section 2 due to discrepancies in the computation of consensus, the termination decisions may be different. To overcome this we run consensus on the local decisions f0; 1g taken at each machine, as in Algorithm 3. In Eq. (34) it has been implied that the validation set is the same for every machine. However, it is likely that the validation set is also partitioned, T ¼ ∪nk ¼ 1 k T . Then Eq. (34) can no longer be applied as is. Instead, we compute the differences locally Δk R ¼ k Rðk T ; f ðank ÞÞ−k Rðk T ; f ðak ÞÞ

ð35Þ

and augment the aforementioned process by running one more consensus step to determine the difference over the entire validation set. Alternatively, consensus should be executed in order to agree on local decisions. Otherwise, the machines might not terminate simultaneously.

4. Numerical verification We have initially tested our results on a variant of the so-called two-moons dataset [15]. The generation is performed by uniformly sampling points along a circle of radius 1. The points from the upper half of the circle are displaced vertically and horizontally, then Gaussian noise is added in the vertical direction. These were 0.9, 0.5 and 0.1 for the vertical, horizontal, and standard deviation, respectively. The upper half has been labelled as class 0 and the lower half 1 (Fig. 1). 4.1. Example of classification

3.1. Early stopping Early stopping is used to terminate the algorithm. This is mainly due to the difficulty in communicating a decreasing learning rate throughout the network. Additionally, early stopping provides a simple method to avoid over-fitting. In the framework, presented in Algorithm 3, we can directly implement this as the termination criterion. We can employ early stopping, with the termination criterion in Eq. (34), in a distributed fashion by validating the learned network after each epoch on another dataset (validation set) locally. ( 0 if RðT ; f ðank ÞÞ−RðT ; f ðak ÞÞ ≤0 CðT ; e; ak ; ank Þ ¼ ð34Þ 1 if RðT ; f ðank ÞÞ−RðT ; f ðak ÞÞ 4 0

We have selected a neural network of 7 neurons placed in two hidden layers [5 2]. The sigmoid function was implemented as activation function of the neurons in these two layers. The input and output layers had linear functions. For the purpose of the experiment, we have generated a number of 50 datasets, with 80 examples per class. Each dataset has been divided in 10 parts; one for each machine on the network. The points have been sampled uniformly such that each partition contains examples from both classes. The dataset has been locally partitioned into training and validation for the purpose of locally training the neural networks. Thereafter, the partitions were allocated to the vertices for each of the 50 datasets. Additionally, one dataset, T was generated for the purpose of measuring performance, given by Eq. (3). Then they have been allocated on the vertices of the communication graph,

1.5

1.5

0 1 1

1

0.5

0.5

0

0

−0.5

−0.5

−1 −1

0

1

2

−1 −1

0

1

2

Fig. 1. Example of the Two Moons Dataset. From left to right, in (a) an example of the two moons dataset is shown. This has two classes, top red is class 0 and bottom blue is class 1. In (b), the partitions occurring with uniform sampling are presented. Different, shape, size and colour of the points, indicate different partitions. The given example consists of 10 partitions. All partitions have examples from both classes. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

8

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

and then (Algorithm 3) has been executed. For the purpose of comparing performance, a neural network for each of the 50 datasets has been trained on the entire training set. Moreover, training has been performed with the same initialisation and training parameters in all networks and in all three cases of, consensus, non-distributed and local learning. Overall the results verify the theoretical results, that the algorithm converges. An exemplary case of the output is shown in Fig. 2. The first sub-figure (a), in Fig. 2, exhibits the output of a specific machine in the communication graph after having been trained with the consensus machine learning algorithm. This is roughly equivalent to the output of any other machine on the network since all the classifiers after consensus are roughly identical. The specific machine is designated on the communication graph (f) with a red circle. Second, the output of an identical neural network that has been trained with the entire dataset is displayed in (b). Third, the output of the same network which was trained only on the local subset belonging to the designated machine is displayed in (c). For the rest of this paper, we employ the definitive consensus algorithm. In some cases we employ regularisation on the risk function of the neural network. As a regularisation function in this sequence of experiments, we have used the mean of the square of the learning parameters, i.e. the neural network weights. The empirical risk is modified to include this as follows: r k ¼ Rðk D; f k Þ þ ð1−ζÞ

1 ma ∑ ½a 2 ma j ¼ 1 j k

ð36Þ

where ma is the number of weights on the entire neural network, and j is some indexing of the weights throughout the entire neural network. The outer index k indicates the machine. In Fig. 3, we use definitive consensus learning. The classification output of the networks trained only with the local dataset (c) are somewhat in accordance with those obtained by training with the global dataset consensus, sub-figure (b). The results of training with consensus learning (a) are better than having trained

0 1

directly with the global dataset. The better performance of Algorithm 3 in comparison with directly learning on the entire dataset is justified by the fact that the local learning iterations are more than one, as in theory, which in fact complicates the learning process. We have attained such results in numerous cases, and the matter is an interesting research direction. However, in this paper we just aim at providing a proof of concept for the general case. We provide Fig. 4 for the case that the local learning phase consists of only one iteration. Finally the convergence properties of Algorithm 3 may be verified in Fig. 5. We plot a few of the dataset instantiations and the classification error rate at the end of each epoch. For comparison we have plotted the output on the test set of having had trained with only one subset of the data for the same number of learning iterations. 4.2. Test on the 2007 TREC public spam corpus The method was tested on a real world dataset, the TREC spam corpus, available at http://plg.uwaterloo.ca/gvcormac/treccor pus07/. We demonstrate that the method converges in large datasets. Moreover, the non-distributed algorithm achieves the performance of the non-distributed counterpart. Two issues need to be resolved, first which learning algorithm to employ, and second how to pre-process the corpus. We address both these matters by selecting a simple approach. In contrast, a sophisticated method might conceal deficiencies of distributed machine learning by consensus, possibly by compensating for errors in the local update step. For the learning algorithm we chose the neural network with back-propagation, presented in Section 3 and a bag-ofwords approach for data reduction, detailed below Section 4.2.1. 4.2.1. Representation and data reduction In the bag-of-words approach each e-mail is represented as a vector of frequencies of tokens in the corpus. That is the frequency

0 1

0 1

0 1

Fig. 2. Classification output with learning by consensus. (a) The output of the distributively trained network on the Test set, cutoff=0.5, datasetid:9. (b) The output of the non-distributively trained network, cutoff=0.5, datasetid:9. (c) The output by training with just the local dataset, cutoff=0.5, datasetid:9. (d) The true Test set, datasetid:9. (e) The segregation of the training dataset. Each combination of colour and figure designates a different partition. The data-points for the selected dataset are specified with a blue star, datasetid:9. (f) The communication graph. The red circle designates the selected node for which the results are shown. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

0 1

0 1

9

0 1

0 1

Fig. 3. Classification output on the test set with learning by definitive consensus learning without regularisation. Exactly the same results are obtained for both the centralised (b) and the local learning set (c). The local learning phase iteration where set to l ¼ 100. The algorithm stopped at epoch 27 by employing early stopping on the local test sets. Details as in Fig. 2. (a) Network output, train with consensus, cutoff=0.5, datasetid:11, (b) Network output, train with entire dataset, cutoff=0.5, datasetid:11, (c) Network output, train with local dataset, cutoff=0.5, datasetid:11, (d) True Classes, Test Set datasetid:11, (e) TrainDataSet segregation, datasetid:11 and (f) Graph.

0 1

0 1

0 1

0 1

Fig. 4. Classification without differential learning and regularisation ζ ¼ 0:8—Uniform partitioning. Exactly the same results are obtained for both the centralised and distributed algorithms, when the local learning phase has duration of only one iteration of the learning algorithm. The number of epochs has been increased to 200. Details as in Fig. 2. (a) Network output, train with consensus, cutoff=0.5, datasetid:4, (b) Network output, train with entire dataset, cutoff=0.5, datasetid:4, (c) Network output, train with local dataset, cutoff=0.5, datasetid:4, (d) True Classes, Test Set datasetid:4, (e) TrainDataSet segregation, datasetid:4 and (f) Graph.

of the jth token in the ith email of the kth partition of the dataset is k

xij ¼ k νij Nj

ð37Þ

where k νij is the number of occurrences of the jth token in the ith email of the kth partition, and the total number of occurrences in the entire corpus of the jth token. We detail pre-processing for the extraction of frequencies in Section 4.2.2. The dimension of the frequency vectors is equal to

10

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

Fig. 5. Convergence of learning by definitive consensus. The red curve depicts the classification error rate at each epoch of the algorithm. The line is the classification error rate with just the local dataset. The results depict the output of one machine on the network. The output for a few of the different instantiations of the two moon datasets are shown. Dataset #11 in our experiments corresponds to Fig. 3. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

the number of distinct tokens in the corpus. Specifically, more than 155 000 distinct tokens were detected in the corpus. However, this violates memory limitations imposed by Matlab. In order to address them, we resulted in reducing the frequency vector's dimension, as follows. The tokens have been sorted with respect to the total number of occurrences on the corpus. Then we specified three regions from this sorted list, head, middle, and tail, representing respectively very frequent tokens, less frequent, and least frequent tokens. The head and the tail have been designated as the first and last 10 000 tokens, respectively, and the middle section as the region 5000 tokens before and after the middle of the sorted list. Subsequently, we have randomly sampled 1000 tokens from each section, totalling 3000 tokens. This resulted in a dataset of 75 419  3000, i.e. 75 419 frequency vectors of dimension equal to 3000. Finally, we have randomly chosen 30 000 frequency vectors out of the total 75 419 to form the reduced dataset employed in our tests. Our approach may have reduced the amount information in the original corpus. However, the reduced dataset is employed both for the distributed and the non-distributed case; thus comparison between the two is fair.

argue that it is impractical to centrally pre-process the corpus and then distribute the result to perform learning on it. Hence, it is of interest to show that pre-processing can be performed distributively to produce identical outcome. We outline a pre-processing procedure that can be easily implemented in a fully distributed and decentralised fashion. This can be achieved by employing a hash-table and the consensus protocol Algorithm 1. Two issues need be considered. First, the addition of new tokens in the dictionary. Second, the computation of the total occurrences in the entire corpus for each token. The former can be addressed by using a hash function to retrieve a hash of the token and map it to the appropriate index of the frequency vector, the hash is that index. Hence, each email is now represented as a list of [hash, occurrences] pairs. Tokens that are not present in the email have zero occurrence, and need not be considered. The second issue, to compute the total occurrences for each token Nj, can be addressed by employing the consensus algorithm (Algorithm 1) on the local sum of occurrences of the jth token at the kth partition. k

mk

N j ¼ ∑ k νij

ð38Þ

i

4.2.2. Pre-processing and preparation of the corpus The corpus consists of 75 419 emails, labelled as spam or ham [16]. These are stored as individual ASCII text files inclusive of email headers. We first extract tokens from each email by employing bogolexer, available at http://bogofilter.sourceforge.net/, which results in a list of tokens for each email and the number of occurrences k νij of each token in the email. Each email is represented as list of pairs [token, occurrences]. Each new distinct token encountered in the corpus is added in a token dictionary, a global list of [token, total occurrences] pairs, where total occurrences is Nj. The frequency of each token in an email can be computed by virtue of Eq. (37). This results in the representation of each email as a vector of token frequencies, with dimension equal to the total number of tokens in the dictionary. This concludes the pre-processing step. This process has been performed centrally for the purpose of this test. We have not gone into the extent of implementing preprocessing distributively. It is sufficient to provide the preprocessing output, the frequency vector representation of each email, to both the distributed or the non-distributed learning machine, in order to justify a fair comparison. However, one could

Only [hash, occurrences] pairs need be exchanged at each step of the consensus algorithm. The latter does not imply an exchange of datapoints, and does not expose the local data, thus preserving data privacy. Moreover, exchange of [hash, occurrences] pairs is a privacy preserving mechanism, depending on the choice of the hash function. Finally, what is stored at each node is the global dictionary as in the case of centralised pre-processing, a list of [hash, total occurrences] pairs, and the lists of [hash, occurrences] pairs corresponding to the emails in the local partition of the corpus. Trivially, each email representation can be transformed from a list of [hash, occurrences] pairs to a list of [hash, frequency] pairs. The tokens not present in the local data set need not be considered. Distributed pre-processing results in an identical representation as central pre-processing.

4.2.3. Training and results The reduced dataset has been employed on a simulated network of 10 connected machines. The dataset was balanced to

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

contain the same number of examples from both classes. The generated dataset of 30 000 vectors has been separated into 2=3 training set and 1=3 test set. The training set has been partitioned into 10 subsets. Then each of these 10 subsets has been partitioned into train and validation (2=3 training set). These 10 subsets have been used for the simulation of the consensus machine learning process in Matlab. We have used a variety of different communication network topologies. However, convergence or the results did not seem to be affected. We have also explored differently sized neural networks (the local learning machines), and with different neural network architectures. We exhibit two cases of our experiments. The first case in Fig. 6 is a simple three layer neural network with one hidden layer, having 10 neurons on the first, 50 on the second and 10 in the third. We denote such an architecture as [10 50 10]. The communication network topology of the machines is quite uniform with only two branches; thus having central nodes, which may increase the precision of the consensus step. Subsequently, we present a neural network with architecture [12 20 30 10] in Fig. 7. The communication network's topology is somewhat more complex in comparison the previous, having two clusters and two cycles. In all examples the distributed data inference converges just as the non-distributed. The classification error rate is not sufficiently low, but comparison with the non-distributed reveals that the non-distributed performs similarly in these simple neural network architectures. Nevertheless, in all figures the reduction of the classification error rate and the mean square error is evident. Better results may be obtained on this corpus maybe with more sophisticated pre-processing of the given data-set.

11

4.3. Discussion According to theory Section 2.1 convergence can be guaranteed when the learning algorithm imposes a contraction on the learning data. In the specific case of gradient descent we have shown that taking subsets of data of equal size and a consolidation step that is the arithmetic mean of the parameter vectors is sufficient to guarantee convergence. This is exactly what happens when the definitive consensus algorithm is used. In the case of the consensus learning algorithm, convergence is still guaranteed but its efficacy is a matter of combination between local learning and global learning iterations. These theoretical conclusions are justified by the results in Section 4.1. Convergence is evident in Fig. 5 for the two-moons dataset, and in Figs. 6 and 7 for the case of the TREC spam corpus. Even though in the former, classification results are weak, convergence is still justified. Nonetheless, the weak classification results are mostly an issue of pre-processing and the classifier, and they are present in the centralised case as well. Another local learning algorithm may be used for this case to obtain better results. However, our purpose has been to show convergence in a real world dataset. Hence, we have not yet researched this direction. In contrast the results obtained on the two-moons dataset are often superior to those obtained by employing the same centralised machine learning. This is more evident in the case that we let the local learning phase execute for l 4 1. When l ¼1 the results are near identical, which further justifies the results in Section 2.1. We believe that the effect of l is related to many factors, among them sparsity of the data, partitioning process, and the dataset itself. It remains as a parameter to tune during learning.

Fig. 6. Convergence on the reduced TREC Spam Corpus on a simple neural network. The (MSE) after each epoch is shown. Results are shown on the test set for a neural network of [10 50 10] neurons. Early stopping with 6 validation checks has been used. The regularisation parameter was set at ζ ¼ 0:80. The (MSE) and (CER) of the nondistributed was 0.18 and 0.31, respectively. The convergence of the distributed algorithm can be verified. Details: (a) The mean squared error (MSE). (b) The classification error rate (CER). (c) The communication graph.

Fig. 7. Convergence on the reduced TREC Spam Corpus on a larger neural network. The convergence of the consensus machine learning framework on the reduced TREC Spam Corpus can be evaluated by examination of the loss function during training. Results are shown on the test set for a neural network of [12 20 30 10] neurons. Early stopping with 6 validation checks has been used. The regularisation parameter was set at ζ ¼ 0:90. The (MSE) and (CER) of the non-distributed was 0.19 and 0.30, respectively. The sub-figures demonstrate the convergence of the distributed algorithm. Particularly, (a) The mean squared error (MSE). (b) The classification error rate (CER). (c) The communication graph.

12

L. Georgopoulos, M. Hasler / Neurocomputing 124 (2014) 2–12

The success of the distributed learning algorithm is not related to the communication graph G. This has been both theoretically and numerically verified by simulations. One has to choose between using the simple consensus or the definitive consensus algorithm. The first is appropriate in larger communication networks or distributed computing architectures, that the coefficients for the definitive consensus algorithm are hard to compute. There is also the case that the topology of the network might be unknown, as in the case of swarm robotics, or other collaborative applications. There the simple consensus algorithm should be employed. In all other cases the definitive consensus algorithm is more appropriate. Finally, a number of related research directions are of interest. These are of applied interest, such as obtaining the equations for the application of other machine learning algorithms, or of theoretical interest such as the relation between data sparsity, partitioning, learning rate, and local learning iterations, the case of stochastic gradient descent and other variants. These matters are of interest that enable to increase the efficacy and applicability of the distributed learning algorithm by consensus.

5. Conclusion We have presented an algorithm for performing machine learning distributively by incorporation of the consensus algorithm. The principal contribution is the proof of convergence of distributed machine learning under mild assumptions; these are realistic in many applications and machine learning algorithms. Furthermore, this framework has been specified for the multilayer feed-forward neural network with back-propagation. We have used the latter to verify the theoretical results. Importantly, during the distributed process there is no exchange of data, only the learned quantities are shared between neighbouring machines. Moreover, the effort required at each machine is reduced in comparison to having to process the entire dataset. However, the total effort may be larger. In contrast the time to compute is reduced nearly linear to the number of machines in the network, since the computation of the local datasets is performed in parallel. These properties are favourable in many applications due to privacy considerations, communication, energy, and computation costs. These outline the cases where the algorithm presented in this paper for distributed machine learning can be applied. References [1] J.N. Tsitsiklis, Problems in Decentralized Decision Making and Computation, Ph.D. Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 1984. [2] L. Xiao, S. Boyd, S.-J. Kim, Distributed average consensus with least-meansquare deviation, J. Parallel Distrib. Comput. 67 (2007) 33–46. [3] R. Olfati-saber, J.A. Fax, R.M. Murray, Consensus and cooperation in networked multi-agent systems, in: Proceedings of the IEEE, 2007. [4] L. Georgopoulos, Definitive Consensus for Distributed Data Inference, Ph.D. Thesis, École Polytechnique FéDérale de Lausanne, 2011. [5] M. Rabbat, R. Nowak, Distributed optimization in sensor networks, in: In Third International Symposium on Information Processing in Sensor Networks (IPSN 04), 2004, pp. 20–27.

[6] V. Vapnik, An overview of statistical learning theory, IEEE Trans. Neural Networks 10 (1999) 988–999. [7] A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, J. NavarroAbellan, Distributed support vector machines, IEEE Trans. Neural Networks 17 (2006) 1091–1097. [8] Y. Lu, V. Roychowdhury, L. Vandenberghe, Distributed parallel support vector machines in strongly connected networks, IEEE Trans. Neural Networks 19 (2008) 1167–1178. [9] H. Ang, V. Gopalkrishnan, S. Hoi, W. Ng, Cascade RSVM in peer-to-peer networks, Mach. Learn. Knowl. Discovery Databases (2008) 55–70. [10] W. Kowalczyk, N. Vlassis, Newscast em, in: In NIPS 17, MIT Press, 2005, pp. 713–720. [11] E. Kokiopoulou, P. Frossard, Graph-based classification of multiple observation sets, Pattern Recognition 43 (2010) 3988–3997. [12] F. Benezit, P. Thiran, M. Vetterli, Interval consensus: from quantized gossip to voting, in: IEEE, ICASSP 2009, pp. 3661–3664. [13] F. Benezit, Distributed Average Consensus for Wireless Sensor Networks, Ph.D. Thesis, Information and Communications Sciences, Lausanne, 2009. [14] C.M. Bishop, Neural Networks for Pattern Recognition, 1 ed., Oxford University Press, USA, 1996. [15] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schlkopf, Learning with local and global consistency, in: Advances in Neural Information Processing Systems, vol. 16, MIT Press, 2004, pp. 321–328. [16] G. Cormack, T. Lynam, Spam corpus creation for TREC, in: Stanford University.

Leonidas Georgopoulos received his B.Sc. in Physics from the National Kapodistrian University of Athens (University of Athens) in 2004, his M.Sc. in Artificial Intelligence from The Edinburgh University in 2005, and he received his Ph.D. in Sciences from the School of Computer, Communication, and Information Sciences of the Swiss Federal Institute of Technology Lausanne (EPFL) in 2011. He has received in the past the Outstanding M.Sc. Project prize from the University of Edinburgh, and the Scholarship for Foreign Students in Arts and Sciences from the Swiss Confederation, and the Student Paper Award in the International Symposium for Nonlinear Theory and Applications. His current research interests are within the field of distributed information processing in complex networks. Particularly, his current research is related to the application of machine learning algorithms distributively in ad-hoc networks, grid computing, and cloud computing.

Martin Hasler received the Diploma in 1969 and the Ph.D. degree in 1973 from the Swiss Federal Institute of Technology, Zurich, both in physics. He continued research in mathematical physics at Bedford College, University of London, from 1973 to 1974. At the end of 1974 he joined the Circuits and Systems group of the Swiss Federal Institute of Technology Lausanne (EPFL), and later headed the Nonlinear Systems Laboratory as an associate and full professor. In 2002, he was acting Dean of the newly created School of Computer and Communication Sciences of EPFL. Since March 2011 he is honorary professor (emeritus) of EPFL. In the recent past (and to some extent still today) his research interests were (are) centred in nonlinear dynamics and information processing, both in engineering and in biological systems. In particular, he was interested in the engineering applications of complicated nonlinear dynamics, especially chaos. This also includes the modelling and identification of nonlinear circuits and systems. In particular, he was (is) concentrating his research effort on the qualitative behaviour and modelling of complex dynamical networks, be they of biological or technical nature. The study of synchronisation phenomena is part of this effort. He is a Fellow of the IEEE. He was Associate Editor and then Editor-in-Chief of the IEEE Transactions on Circuits and Systems. He was member of the Board of Governors and Vice-President for Technical Activities of the IEEE CAS Society. He was a member of the Scientific Council of the Swiss National Science Foundation.