Improving Distributed Stochastic Gradient Descent Estimate via Loss Function Approximation

Improving Distributed Stochastic Gradient Descent Estimate via Loss Function Approximation

16th IFAC workshop on Control Applications of Optimization 16th IFAC6-9, workshop on Control ApplicationsGermany of Optimization October 2015. Garmisc...

640KB Sizes 0 Downloads 93 Views

16th IFAC workshop on Control Applications of Optimization 16th IFAC6-9, workshop on Control ApplicationsGermany of Optimization October 2015. Garmisch-Partenkirchen, 16th IFAC IFAC workshop workshop on Control Control Applications Applications of of Optimization Optimization 16th on October 6-9, 2015. Garmisch-Partenkirchen, Germany Available online at www.sciencedirect.com October 6-9, 2015. Garmisch-Partenkirchen, Germany October 6-9, 2015. Garmisch-Partenkirchen, Germany

ScienceDirect IFAC-PapersOnLine 48-25 (2015) 292–297

Improving Distributed Stochastic Improving Improving Distributed Distributed Stochastic Stochastic Gradient Descent Estimate Gradient Descent Estimate Gradient Descent Estimate via Loss Function Approximation via via Loss Loss Function Function Approximation Approximation

Alexander A. Senov ∗∗ Alexander A. Senov ∗ Alexander Alexander A. A. Senov Senov ∗ ∗ ∗ Saint-Petersburg State University, Saint-Petersburg, Russia ∗ Saint-Petersburg State University, Saint-Petersburg, Russia ∗ Saint-Petersburg State University, Saint-Petersburg, Saint-Petersburg, Russia Russia (e-mail:State [email protected]). Saint-Petersburg University, (e-mail: [email protected]). (e-mail: (e-mail: [email protected]). [email protected]). Abstract: Both problems of learning and optimization in context of huge amounts of data Abstract: Both problems of learning and optimization in context of huge amounts of data Abstract: problems of and in context of amounts of became veryBoth important nowadays. Unfortunately even online and parallel optimization methods Abstract: Both problems of learning learning and optimization optimization in and context of huge huge amountsmethods of data data became very important nowadays. Unfortunately even online parallel optimization became important even online parallel methods may failvery to handle suchnowadays. amounts Unfortunately of data and fit into timeand limits. For optimization this case distributed became very important nowadays. Unfortunately even online and parallel optimization methods may fail such amounts of data and time limits. For case may fail to to handle handle such amounts of data and fit fitIninto into time limits. For this thisparticular case distributed distributed optimization methods may be the of only solution. thistime paper we consider type of may fail to handle such amounts data and fit into limits. For this case distributed optimization methods may be the only solution. In this paper we consider particular type of optimization methods may be the only solution. In this paper we consider particular type of problem in distributed setting. We propose an algorithm substantially based on optimization methods may be the only solution. In this paper we consider particular type of optimization problem in distributed setting. We propose an algorithm substantially based on optimization problem in distributed setting. We propose an algorithm substantially based on the distributed stochastic gradient descent method proposed in Zinkevich et al. (2010). Finally optimization problem in distributed setting. We propose an algorithm substantially based on the distributed stochastic gradient descent method proposed in Zinkevich et al. (2010). Finally the distributed gradient method Zinkevich et Finally we experimentally study properties of the proposed algorithmin and demonstrate its superiority the distributed stochastic stochastic gradient descent descent method proposed proposed inand Zinkevich et al. al. (2010). (2010). Finally we experimentally study properties of the proposed algorithm demonstrate its superiority we experimentally study propertiesproblem. of the the proposed proposed algorithm algorithm and and demonstrate demonstrate its its superiority superiority for experimentally particular type study of optimization we properties of for particular type of optimization problem. for for particular particular type type of of optimization optimization problem. problem. © 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Large Scale Optimization Problems; Stochastic Optimization; Parameter Keywords: Scale Optimization Problems; Stochastic Optimization; Parameter Keywords: Large Large ScaleAlgorithms; Optimization Problems; Stochastic Optimization; Parameter Estimation; Learning Parallel Algorithms; Steepest Descent; Parameter Least-Squares Keywords: Large Scale Optimization Problems; Stochastic Optimization; Estimation; Learning Algorithms; Parallel Algorithms; Steepest Descent; Least-Squares Estimation; Learning Algorithms; Parallel Algorithms; Steepest Descent; Least-Squares Estimation Learning Algorithms; Parallel Algorithms; Steepest Descent; Least-Squares Estimation; Estimation Estimation Estimation Both machine learning and optimization became very pop- Despite the fact that they usually need more iterations the fact that they usually need more iterations Both machine machine learning learning and and optimization optimization became became very very poppop- Despite Both Despite the fact that they need more contrast batch methods the overall often ular and widely used inand today’s control became applications (see, in Despite theto fact that they usually usually needcomplexity more iterations iterations Both machine learning optimization very popin contrast to batch methods the overall complexity often ular and widely used in today’s control applications (see, ular and widely used in today’s control in contrast contrast tobebatch batch methods thethe overall complexity often turned out to lower. One ofthe mostcomplexity popular online for example Boyd (2012); Granichin et applications al. (2014)). (see, It is in methods overall often ular and widely used in today’s control applications (see, out be lower. One of the most popular online for example example Boyd Boyd (2012); (2012); Granichin Granichin et et al. al. (2014)). (2014)). It It is is turned for turned out be lower. One of the most popular online optimization algorithm is stochastic gradient descentonline prowellexample known that most part Granichin of machineetlearning problems turned out be lower. One of the most popular for Boyd (2012); al. (2014)). It is algorithm is stochastic gradient descent prowell known known that that most most part part of of machine machine learning learning problems problems optimization well optimization algorithm stochastic gradient descent proposed in Davidon (1976)is and further extensively developed can be translated to the optimization task of real-valued optimization algorithm is stochastic gradient descent prowell known that most part of machine learning problems posed in Davidon (1976) and further extensively developed can be translated to the optimization task of real-valued can be translated to the optimization task of real-valued posed in Davidon (1976) and further extensively developed especially for learning purposed Cun et al. (2004); Bottou convex parametrized function L(ω), known as loss function posed in Davidon (1976) and further extensively developed can be translated to the optimization task of real-valued convex parametrized parametrized function function L(ω), L(ω), known known as loss loss function function especially convex especially for for learning learning purposed purposed Cun Cun et et al. al. (2004); (2004); Bottou Bottou (1998). (Bottou (1998)). Moreover, inL(ω), most cases as such functions for learning purposed Cun et al. (2004); Bottou convex functionin loss function especially (1998). (Bottouparametrized (1998)). Moreover, Moreover, mostknown cases as such functions functions N (Bottou (1998)). in most cases such (1998). Occasionally, even online algorithms might fail cause of functions (1998). (Bottou (1998)). Moreover, in most  (ω). are separable (Boyd et al. (2011)), i.e.cases L(ω) such = N Occasionally, even online algorithms might fail cause of N ii (ω). are separable separable (Boyd (Boyd et et al. al. (2011)), (2011)), i.e. i.e. L(ω) L(ω) = = i=1 Occasionally, even online algorithms of N became time limits. To accelerate them might we canfail usecause parallel i=1 are (ω). tight Occasionally, even online algorithms might fail cause of Thus, optimization of convex separable functions ii (ω).  are separable (Boyd et al. (2011)), i.e. L(ω) = i=1 tight time limits. To accelerate them we can use parallel i=1 Thus, optimization of convex separable functions became tight time limits. To accelerate them we can use parallel computation and corresponding parallel optimization alThus, optimization of convex separable functions became tight time limits. To accelerate them we optimization can use parallel rather optimization important problem in context of control. Thus, of convex separable functions became computation and corresponding parallel alrather important problem in context of control. computation and corresponding parallel optimization algorithms. Such parallel algorithmsparallel (see, for example Boyd rather important in of and corresponding optimization alConvex functions problem optimization problem is very well stud- computation rather important problem in context context of control. control. gorithms. Such parallel algorithms (see, for example Boyd Convex functions optimization problem is very well studgorithms. Such parallel algorithms (see, for example Boyd et al. (2011); Bertsekas (1997)) efficiently utilize shared Convex functions optimization problem is very well studgorithms. Such parallel algorithms (see, for example Boyd ied (see Boyd and Vandenberghe (2004) for an extensive Convex functions optimization problem isfor very well stud- et al. (2011); Bertsekas (1997)) efficiently utilize shared ied and (2004) an et al. Bertsekas (1997)) efficiently utilize and tight processes coupling to communicate beied (see (see Boyd Boyd and Vandenberghe Vandenberghe (2004)and fordifferentiabilan extensive extensive memory et al. (2011); (2011); Bertsekas (1997)) efficiently utilize shared shared overview). Assuming function convexity ied (see Boyd and Vandenberghe (2004) for an extensive memory and tight processes coupling to communicate beoverview). Assuming function convexity and differentiabilmemory and tight processes coupling to communicate between processes intensively thus ensuring great speedup. overview). Assuming function convexity and differentiabilmemory and tight processes coupling to communicate beity one canAssuming choose among many first-order opti- tween processes intensively thus ensuring great speedup. overview). function convexity and iterative differentiability one can choose among many first-order iterative optitween processes intensively thus ensuring great speedup. Though parallel algorithms provide great speedup on a ity one can choose among many first-order iterative optitween processes intensively thus ensuring great speedup. mization algorithms, e.g. gradient descent algorithm also ity one can choose among many first-order iterative optiThough parallel algorithms provide great speedup on a mization algorithms, e.g. gradient descent algorithm also Though parallel algorithms provide great speedup on a single machine even them may fail if amount of data mization algorithms, e.g. gradient descent algorithm also Though parallel algorithms provide great speedup on known as steepest descent originally proposed in Cauchy mization algorithms, e.g. gradient descent algorithm also single machine even them may fail if amount of dataa known as steepest descent originally proposed in Cauchy single machine even them may fail if amount of data exceeds size of whole machine memory and it is imknown as as steepest descent originally proposed inoptimizaCauchy single machine even them may memory fail if amount of data (1847). The generaldescent idea oforiginally first-orderproposed iterativein known steepest Cauchy exceeds of machine and it im(1847). The general idea of first-order iterative optimizaexceeds size size of whole whole machine memory and any it is ismore. impossible to scale single machine machine memory performance (1847). The general of first-order size of whole and it is imtion algorithms is toidea gradually improveiterative currentoptimizaestimate exceeds (1847). The general idea of first-order iterative optimizapossible to scale single machine performance any more. tion algorithms is to gradually improve current estimate possible to scale single machine performance any more. Fortunately, latest hardware and software developments tion algorithms is to gradually improve current estimate possible to scale single machine performance any more. of function argmin by moving it on direction opposite to tion algorithms is to gradually improve current estimate Fortunately, latest hardware and software developments of argmin by moving it opposite to Fortunately, latest hardware software developments possible to design evenand distributed — of function function argmin byThese moving it on on direction direction opposite to makes Fortunately, latest hardware and software algorithms developments the functionargmin gradient. algorithms calculates gradient of function by moving it on direction opposite to makes possible to design even distributed algorithms — the function gradient. These algorithms calculates gradient makes possible to design even distributed algorithms — that works on multiple computing nodes with wired of theentire function gradient. These algorithms calculates gradient makesworks possible to designcomputing even distributed algorithms — of L gradient. function, These thus algorithms computational complexity of that the function calculates gradient on multiple nodes with wired of of entire L function, thus computational complexity of that works on multiple computing nodes with wired of wireless connection between them. The only limit of such of entire L function, thus computational complexity of that works on multiple computing nodes with wired of each iteration is at least ). Since they may need of entire L function, thus O(N computational complexity of wireless connection between them. The only limit of such each iteration is at least O(N ). Since they may need wireless connection between them. The only limit of such computational environment is network latency, budget and each iteration is at least O(N ). Since they may need wireless connection between them. The only limit of such a considerable number of iterations to achieve desirable each iteration number is at least O(N ). Since they may need computational environment is network latency, budget and a of to desirable computational environment is latency, budget and of theoretical approaches. Indeed, despite the fact a considerable considerable numberinapplicable of iterations iterations to achieve achieve desirable computational environment is network network latency, budget and accuracy they became in case of largedesirable N . Such lack a considerable number of iterations to achieve lack of theoretical approaches. Indeed, despite the fact accuracy they became inapplicable in case of large N . Such lack of theoretical approaches. Indeed, despite the fact that hardware and software developed significantly in this accuracy they became inapplicable in case of large N . Such of theoretical approaches. Indeed, despite the fact algorithms often mentioned as batchinmethods because they lack accuracy they became inapplicable case of large N . Such that and software developed significantly in algorithms often mentioned as methods because they N batch that hardware hardware and software developed significantly in this this area not much and worksoftware has been done in significantly area of distributed algorithms often mentioned batch methods because they hardware developed in this process entire batch of {i }as at each iteration. Recent algorithms often mentioned as batch methods because they that N i=1 area not much work has been done in area of distributed process entire batch of { } at each iteration. Recent N i area not much work has been done in area of distributed i=1 optimization, especially in context of such computational N process entire batch of { } at each iteration. Recent area not much work has been done in area of distributed developments in randomized optimization algorithms may i i=1 at each iteration. Recent process entire in batch of {i }i=1 optimization, especially in context of such computational developments optimization algorithms may optimization, especially in of such computational as MapReduce allows inter-node developments in randomized randomized optimization algorithms may models optimization, especiallywhich in context context of only such one computational reduces computational complexity by special randomizadevelopments in randomized optimization algorithms may models as MapReduce which allows only one inter-node reduces computational complexity by special randomizamodels as MapReduce which allows only one communication phase at the end of computation. Some reduces computational complexity by special randomizaas MapReduce which allows only one inter-node inter-node tion andcomputational recurrent estimation techniques Granichin and models reduces complexity by special randomizacommunication phase at the end of computation. Some tion and recurrent estimation techniques Granichin and communication phasein at at the endinclude of computation. computation. Some recent developments this area Chu et al. (2007) tion and recurrent estimation techniques Granichin and communication phase the end of Some Polyak (2003) but it do not change the situation radically. tion and recurrent estimation techniques Granichin and recent developments in this area include Chu et al. (2007) Polyak (2003) but it do not change the situation radically. recent developments in this area include Chu et al. (2007) where authors describe distributed (multi-core in their Polyak (2003) but it do not change the situation radically. recent developments in this area include Chu et al. (2007) In case of large N so-called online optimization algorithms Polyak (2003) but it do notonline changeoptimization the situationalgorithms radically. where authors describe distributed (multi-core in their nonoIn of N where authors describe distributed (multi-core their realization of several machine learning in algorithms In case case of large largepopularity. N so-called so-calledThese onlinealgorithms optimization algorithms where authors describe distributed (multi-core in their nonowin especial consider only tation) In case of large N so-called online optimization algorithms tation) realization of several machine learning algorithms win especial popularity. These algorithms consider only tation) realization of machine learning algorithms implicitly several first-order optimization win popularity. These algorithms consider only tation) realization of several several machine learningalgorithms) algorithms smallespecial fixed amount of i functions at every iteration, thus (and win especial popularity. These algorithms consider only (and implicitly several first-order optimization algorithms) small fixed amount of i functions at every iteration, thus (and implicitly computational several first-order first-order optimization algorithms) in MapReduce model. In Mann et al. (2009) small fixed amount of every implicitly several optimization algorithms) reducing to at O(1) initeration, terms ofthus N . (and small fixediteration amount complexity of ii functions functions at every iteration, thus MapReduce computational model. In Mann et al. (2009) reducing iteration complexity to O(1) in terms of N .. in in MapReduce computational model. In Mann et al. authors solve distributed optimization problem by rather reducing iteration complexity to O(1) in terms of N in MapReduce computational model. In Mann et al. (2009) (2009) reducing iteration complexity to O(1) in terms of N . authors solve distributed optimization problem by rather  authors solve distributed optimization problem by rather simple but inventive strategy: they solve optimization authors solve distributed optimization problem by rather  simple but inventive strategy: they solve optimization  simple but but inventive inventive strategy: strategy: they they solve solve optimization optimization  simple Copyright IFAC 2015 292 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2015, IFAC (International Federation of Automatic Control) Copyright © IFAC 2015 292 Copyright IFAC 2015 292 Peer review© of International Federation of Automatic Copyright ©under IFAC responsibility 2015 292Control. 10.1016/j.ifacol.2015.11.103

292 292 292 292

IFAC CAO 2015 October 6-9, 2015. Garmisch-Partenkirchen, Alexander A. Senov et al. / IFAC-PapersOnLine 48-25 (2015) 292–297 Germany

problem independently on each node and in the end joint solution obtained simply averaging across solutions from every node. This method is even more attractive cause it perfectly fits into MapReduce model using only one MapReduce iteration. This solution was further developed in Zinkevich et al. (2010) where authors use stochastic gradient descent on every node with only subset of {i }N 1 and provide strong theoretical analysis of convergence for both original and distributed versions of stochastic gradient. Despite the great theoretical and experimental analysis in Chu et al. (2007), Mann et al. (2009) and Zinkevich et al. (2010) all these papers exploit the same idea for distributed optimization: first run independent optimization processes on every node, then collect and average obtained parameter estimates. In this paper we replace this last averaging phase with slightly more complex but efficient strategy: we build and approximation L of function L using data additionally collected at each node during independent optimization process and take argmin L as final parameter estimate. The paper is organized as follows. In section 1 we cover some basics and previous results required for further explanation. Then in section 2 we formulate the particular distributed optimization problem statement. In section 3 we propose an algorithm aimed to solve the proposed problem. Finally, in section 4 we illustrate some of its properties and perform comparative analysis with algorithm proposed in Zinkevich et al. (2010) on modelled data. 1. PRELIMINARIES In this section we briefly introduce some methods and concepts necessary for further explanation. 1.1 Notation remarks We use small light symbols x for scalars and indexes (except parameter vector ω), small bold symbols x for vectors, capital light symbols X for constants and sets (except parameter matrix Θ), capital bold symbols X for matrices. Specifically we denote K as number of nodes, T as number of iterations, λ as iterative optimization step size, B as size of mini-batch 1.2 Separable stochastic quadratic function This entire paper is concentrated on particular type of functions: separable stochastic quadratic real-valued function L parametrized by ω ∈ Rd . By term separable we mean that L can be represented as follows N 1  L(ω) = (ω). (1) N i=1

By term quadratic we mean that every i is quadratic in its argument vector ω (so as L). By term stochastic we mean that that every function i is noised with some stochastic nature, i.e i (ω) = (ω) + εi (ω), where (ω) is some standard quadratic function and εi (ω) is some zeromean finite-variance random function (i.e. it is a random variable somehow dependent on ω). This choice of particular loss function type may seem unobvious. However, it can be easily explained. Consider

293

293

linear regression problem (described in section 1.5) with input vectors xi , output scalars yi and unknown parameter ω. The loss prediction error of this model would be equal to   T 2 T lr ˙ xi xTi ω − 2yi xTi ω + yi2 = ω˙ T Zi ω, i = (yi − xi ω) = ω lr which clearly has the same form as i : lr i (ω) =  (ω) + lr εi (ω), where

˙ lr (ω) = ω˙ T E x,ε [Z] ω, εlr ˙ T (Zi − E x,ε [Z]) ω. ˙ i (ω) = ω 1.3 Stochastic gradient descent Stochastic gradient descent (SGD) became a very popular optimization technology in past decades, especially in machine learning and control areas by virtue of both its simplicity and effectivity. It was originally proposed in Davidon (1976) for least squares estimation and further developed particularly for online learning purposes Cun et al. (2004); Bottou (1998). SGD aimed to solve optimization problem of stochastic separable functions in following way. Assuming i convexity and using min  = E min i , ω

ω

at each iteration we randomly chose one j among {i }N 1 and shift current parameter (arg min) estimate opposite to gradient direction ∂ω j . Stochastic gradient descent pseudo code listed in 1. Algorithm 1 SGD({1 , . . . , N }, T, λ, ω0 ) for t ← 1 to T do Draw j ∈ {1, . . . , N } uniformly at random ωt ← ωt−1 − λ ∂ω j (ωt−1 ) end for return ωT Mini-batch stochastic gradient descent (mbSGD) algorithm use the same strategy, but at each iteration it randomly choose a subset J of size B of functions {i }N 1 and average gradient among gradients of functions from J. Its pseudo code presented in listing 2. Algorithm 2 mbSGD({1 , . . . , N }, T, λ, ω0 , B) for t ← 1 to T do Draw J ⊂ {1, . . . , N } of size |J| = B uniformly at random  ωt ← ωt−1 − λ B1 j∈J ∂ω j (ωt−1 ) end for return ωT

1.4 Distributed stochastic gradient descent In this subsection we rephrase an algorithm proposed in Dekel et al. (2012) our work is substantially based on. Algorithm pseudo code listed in 3. Worth noting that in Zinkevich et al. (2010)) authors assume that dataset size is so large that we can choose arbitrary N to provide enough observations for T iterations on K machines with batch size B. It means that we can first specify number of iterations T , batch size B and number of machines K and then choose appropriate N to met condition T = N/K/B. We will stick to this assumption later on.

IFAC CAO 2015 294 Alexander A. Senov et al. / IFAC-PapersOnLine 48-25 (2015) 292–297 October 6-9, 2015. Garmisch-Partenkirchen, Germany

experiments in section 4, we insist that this work should not be perceived as covering those aspects in any way. We only can refer the reader to other works that cover these aspects.

Algorithm 3 SimuParallel SGD({1 , . . . , N }, λ, B, K, T )

1: Ensure that T = N/K/B 2: Randomly partition the examples on K samples Ik of size |Ik | = T 3: for k ← 1 to K do in parallel 4: Randomly shuffle sample Ik 5: Randomly pick ωk,0 from  domain 6: for t ← 1 to T do 7: Draw J ⊂ Ik of  size |J| = B uniformly at random 1 (ωk,t−1 ) ← B  (ωk,t−1 ) 8: j∈J j

2. PROBLEM STATEMENT

9: ωk,t ← ωk,t−1 − λ∂ω (ωk,t−1 ) ωk,t−1 − λ∂ω j (ωk,t−1 ) 10: end for 11: end for in parallel 12: ω ˆ ←

1 K

K 

ωk,T

k=1

13: return ω ˆ

1.5 Weighted Least Squares Least squares (LS) is a technique used for parameter estimation in following input-output linear model i = 1..N (2) yi = xTi ω + εi , d where xi ∈ R — input vector, yi — output scalar, εi — unknown scalar random noise and ω — unknown parameter which we want to find. Hereinafter we omit intercept term on the supposition of (2) (d) its presence in xi , i.e. xi = (1, xi , . . . , xi )T . T T Denote X = [x1 , . . . , xN ] and y = (y1 , . . . , yN ) . Then, 2 in assumptions E εi = 0, Var εi = σ , E εi εj = 0 ∀i = j the least squares estimate is (3) ω ˆ LS = (XT X)−1 Xy and it has lowest estimate variance among all other linear unbiased estimators (BLUE). However, if the assumption Var εi = σ 2 (homoscedasticity) not met, LS loose this important property. Particularly, it became very important when we face heteroscedasticity: Var εi = σi2 , — errors have different variance for each observation. To overcome this obstacle, one can use a weighted least squares method Aitken (1934). Weighted least squares corrects heteroscedasticity (see White (1980)) using special weight matrix W = diag(w1 , . . . , wN ) where wi ∼ 1/σi2 yielding following weighted least squared estimate (4) ω ˆ W LS = (XT WX)−1 Xy It is easy to see, that weighted least squares is equal to standard least squares with additional multiplying observations yi , ωi on corresponding weight wi .By this multiplication we change original model (2) to wi xTi ω

wi yi = + w i εi , i = 1..N (5) and since Var wi εi = const ∀i this new model obviously homoscedastic. 1.6 Paper scope In this paper we solely focusing on the problem of producing better single parameter vector estimate ω ˆ out of each node estimates {ˆ ω k }K k=1 and some auxiliary information collected during execution of SimuParallel mbSGD algorithm. We left any other aspects of optimization process (e.g. optimal choice of learning rate λ, batch size B, number of cores K, etc.) out of the scope of this paper. Despite the fact, some of these aspects explicitly touched during 294

Now we able to formulate detailed problem statement. First, we formulate our assumptions Assumption 1. Theoretical assumptions N • Function L is separable (1): L = i=1 i • Function i is stochastic ∀i, i.e.

i (ω) = (ω) + εi (ω), where εi (ω) — i.i.d. random function (noise), with E εi (ω) = 0 ∀i = 1..N , and second moment bounded and dependent of ω. • Function convex (hence, it may be approximated by second order polynomial in its argmin neighbourhood ). Due to amount of data, time requirements, software and hardware restrictions only algorithms with following properties considered as feasible Assumption 2. Computational assumption • Algorithm should be distributed, i.e. process on given number of nodes K in parallel manner • Algorithm should be online, i.e. use only O(1) (in terms of N) loss functions i at each iteration • Algorithm should use only single communication step between K nodes at the end of computation Problem statement. Find the minima of function L (1) in assumptions 1 and 2. 3. ALGORITHMS DESCRIPTION In this section we are going to present two algorithms aimed to solve problem imposed in section 2. Actually, algorithm SimuParallelmbsgd aimed to solve exactly this problem. However, its final step which averages parameter estimates from every node may be significantly improved in terms of accuracy. In further section we propose an algorithm based on loss function reconstruction, which significantly improves estimate of SimuParallelmbsgd. 3.1 WLSDSGD algorithm WLSDSGD stands for Weighted Least Squares loss reconstruction based Distributed mini-batch Stochastic Gradient Descent. Its main idea and at the same time the main difference between SimuParallelmbsgd and WLSDSGD lies in that the later one replaces the averaging over argmin estimates from each node step used in SimuParallelmbsgd by gathering some data about loss function from each node, constructing an approximation

 of and using argmin  as final parameter estimate. More precisely main idea of the WLSDSGD algorithm is the following. (1) In addition to final iteration parameter estimate ωk,T each node k also calculate loss function sample mean  (ωk,t−1 )} on ev (ωk,t−1 ) and sample variance {Var ery iteration t at parameter estimate ωk,t−1 and

IFAC CAO 2015 October 6-9, 2015. Garmisch-Partenkirchen, Alexander A. Senov et al. / IFAC-PapersOnLine 48-25 (2015) 292–297 Germany

Input data LN (ω) =

1 N

N 

Algorithm 4 WLSDSGD({1 , . . . , N }, λ, B, K, T )

1: Ensure that T = N/K/B 2: Randomly partition the examples on K samples Ik of size |Ik | = T B 3: for k ← 1 to K do in parallel 4: Randomly shuffle sample Ik 5: Randomly pick ωk,0 from  domain 6: for t ← 1 to T do 7: Draw J ⊂ Ik of  size |J| = B uniformly at random 1 8: (ωk,t−1 ) ← B  (ωk,t−1 ) j∈J j

i (ω)

i=1

Split + distribute

Node 1

Node K

Input data subset: {i }i∈I1 ={π(1),...,π(N/K)}

Input data subset:

...

Applying mbSGD: mbSGD ({i }I1 , ω0 , λ, B, T ) 

Results of mbSGD: T −1  1,i ), Var  (ω1,i ) ω1,i , (ω

{i }i∈IK ={π(N −N/K),...,π(N )}





 V ar (ωk,t−1 ) ←

9:

1 B



j∈J



(ωk,t−1 − (ωk,t−1

10: ωk,t ← ωk,t−1 − λ∂ω (ωk,t−1 ) 11: end for 12: end for in parallel

Applying mbSGD: mbSGD ({i }IK , ω0 , λ, B, T ) 

i=0

295

Results of mbSGD:  T −1  K,i ), Var  (ωK,i ) ωK,i , (ω i=0

13: S K,T ←

Gather

Terminal node Gathered results of mbSGD from  K nodes: K,T −1  k,i ), Var  (ωk,i ) SK,T = ωk,i , (ω

14:

 K,T Sω

15:

K,T SVar[]







(ωk,t ) ωk,t



K,T −1

K,Tk=1,t=0 −1 k=1,t=0

 K,T −1  ← V ar (ωk,t−1 ) k=1,t=0 

K,T K,T 16: ω ˆ ← argmin Poly2WLS SK,T , Sω , SVar[]

k=1,i=0

ω

17: return ω ˆ

 approximation 2nd order polynom:  ← Poly2WLS (SK,T ) ω  ← argmin 

ω 

Fig. 1. WLSDSGD algorithm scheme (2) At final step instead of averaging estimated from each node step more complex, but accurate strategy is used: (a) loss function approximation  is build via wieghted least squares algorithm based on collected data using {ωk,t−1 } as inputs {xi }, {(ωk,t−1 )} as out (ωk,t−1 )} as inverse weights puts {yi } and {Var −1 {wi } in accordance with the notation of 1.5, (b) ω  = argmin  is used as final estimate.

The general overview of WLSDSGD algorithm is visualiz ed on scheme 1. The pseudo code of WLSDSGD algorithm presented in 4. 4. MODELLING In this section we describe the modelling strategy aimed both to demonstrate the WLSDSGD dependency on its parameters and to compare it with SimuParallelmbSGD algorithm. Modelling strategy (1) First, we fix particular type of i function (2) Then, we fix set default parameters B, K, T values and ωk,0 random picking strategy algorithms SimuParallelmbSGD and WLSDSGD. (3) We independently vary each of the parameters above one after another with all other parameters fixed. (4) For each varied parameter value (and other parameters default values) we execute each algorithm multiple times (1000), compute error || ω − argmin ||22 for each run and calculate 0.25, 0.5 and 0.75 quantiles of this errors. Next, we plot these quantiles according to varied parameter values on separate figure for each parameter.

295

2



Using this strategy we produce a plot for each varied parameter. In such a way we demonstrate both individual algorithms accuracy dependency on varied parameter and their comparative accuracy relative to each other. In further sections we describe all the above points in details and discuss plots resulted from this scheme. 4.1 Modelling i function Here we describe concrete type of functions i used for modelling (this functions are loss functions for linear regression problem): i (ω) = (yi − xTi ω)2 = ω T xi xTi ω − 2yi xTi ω, xi ∼ N (mx , Σx ), yi = xi ω ∗ + εi , E εi = 0, Var εi = σε2 , where xi are independent variables, yi are dependent variables and εi are error terms such that E εi εj = σε2 δij and E xi εj = 0 ∀ i, j. Furthermore, we choose specific values for mx , Σx and σε2   mx = (2, −1)T , Σx = (2, 0.5)T , (0.5, 1)T , ω ∗ = (1, 0.5)T .

Clearly this is a particular type of i function described in assumption 1. It is known that argmin  = ω ∗ — true parameter value we want to find.

ω

4.2 Default parameters values We use following default values for parameters λ, B, K and T . λ = 0.01, B = 10, K = 4, T = 50 (6) The ωk,0 random picking strategy is following. Recall that true ω parameter value is ω ∗ = (1, −2)T . We use a uniform distribution over 5×5 square with center in ωsq = (1, −2)T for random picking. In other words, we uniformly pick a (j) (j) (j) value in segment [ωsq − 2.5, ωsq + 2.5] as ωk,0 ∀ k = 1..K, j = 1..2.

IFAC CAO 2015 296 Alexander A. Senov et al. / IFAC-PapersOnLine 48-25 (2015) 292–297 October 6-9, 2015. Garmisch-Partenkirchen, Germany

4.3 Varying parameters

5. CONCLUSION

We varying each parameter B, K, T value and random pick strategy of ωk,0 independently with other parameters fixed in following way. B = {2, 4, 8, 16, 32, 64}, K = {2, 3, 4, 5, 8, 16, 32}, T = {10, 20, 30, 50, 75, 100, 150}, (7) ∗ ∗ ∗ ∗ ωsq = {ω , ω + 0.5, ω + 1, ω + 2, ω ∗ + 4, ω ∗ + 8, ω ∗ + 16}. 4.4 Modelling results discussion

We an WLSDSGD algorithm for distributed optimization of separable stochastic convex function via it reconstruction. This algorithms essentially based on the SimuParallelmbSGD algorithm proposed in Zinkevich et al. (2010) and perfectly fits in such distributed computations framework as MapReduce. We demonstrate performance of proposed in comparative manner. We show that WLSDSGD algorithm outperform SimuParallelmbSGD for most parameter settings, especially when amount of collected data in algorithm WLSDSGD is enough to approximate loss function accurately.

In this section we will describe modelling results. Before that we will make few remarks on accuracy estimation and figures specifics. By error of the algorithm we understand norm ||ω ∗ − ω ˆ ||22 where ω ˆ is an algorithm parameter estimate. As mentioned above, for each algorithm and for each set of parameters values we compute this error 1000 times and then calculate 0.25, 0.5 and 0.75 quantiles of this errors.Having quantile values we plot them as error bars on respective figures. Thus for every parameter value and every algorithm we plot 0.5 error quantile as a point with lower (0.25 quantile) and upper (0.75 quantile) whiskers. Then we connect 0.5 quantiles for different parameter values with line. The legend is presented at each figure where algorithms SimuParallelmbSGD and WLSDSGD designated correspondingly as AVG and WLS. First we will vary number of iterations T . Algorithms errors depending on values of T illustrated on figure 2. It is clear that WLSDSGD outperform algorithm SimuParallelmbSGD starting from at most 20 iterations. This is due to the fact that amount of data used for  approximation in WLSDSGD algorithm is linearly depend on T . Hence, when T is too small amount of data is not enough to obtain stable  approximation.

Fig. 2. Methods estimation errors qunatiles with varying number of iterations T

Now lets take a look at figure 3. Situation here is quite similar: WLSDSGD performs better than LSDSGD which in turn more accurate than algorithm SimuParallelmbSGD starting from about 4 nodes since amount of data used for approximation used in WLSDSGD is linearly dependent on K too. Next consider figure 4 where we can see algorithms errors depending on values of B. It may be noticed that WLSDSGD performs much worse than SimuParallelmbSGD when B is small. This is due to the fact that corresponding weights estimates variance is proportional to B −1 . However starting from reasonable values of B WLSDSGD surpass algorithm SimuParallelmbSGD again. Finally, consider figure 5. Despite the fact that all three algorithms accuracy decreases dramatically with ω shift increased, WLSDSGD still outperforms algorithm (3) . Fig. 3. Methods estimation errors qunatiles with varying number of nodes K

296

IFAC CAO 2015 October 6-9, 2015. Garmisch-Partenkirchen, Alexander A. Senov et al. / IFAC-PapersOnLine 48-25 (2015) 292–297 Germany

Fig. 4. Methods estimation errors qunatiles with varying min-batch size B

Fig. 5. Methods estimation errors qunatiles with varying omega shift ωsq − ω ∗ REFERENCES Aitken, A.C. (1934). On least-squares and linear combinations of observations. In Proceedings of the Royal Society of Edinburgh, volume 55, 42–48. Bertsekas, D.P. (1997). Parallel and distributed computation: numerical methods. Athena Scientific. Bottou, L. (1998). Online learning and stochastic approximations. On-line learning in neural networks, 17(9), 142. Boyd, S. (2012). Global optimization in control system analysis and design. Control and Dynamic Systems V53: High Performance Systems Techniques and Applications: Advances in Theory and Applications, 53, 1. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. R in Machine Learning, 3(1), Foundations and Trends 297

297

1–122. Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press. Cauchy, A. (1847). M´ethode g´en´erale pour la r´esolution des systemes d´equations simultan´ees. Comp. Rend. Sci. Paris, 25(1847), 536–538. Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., and Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19, 281. Cun, L., Yann, L.B., and Bottou, L. (2004). Large scale online learning. Advances in neural information processing systems, 16, 217. Davidon, W.C. (1976). New least-square algorithms. Journal of Optimization Theory and Applications, 18(2), 187–197. Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1), 165–202. Granichin, O., Volkovich, V., and Toledano-Kitai, D. (2014). Randomized algorithms in automatic control and data mining. Springer. Granichin, O. and Polyak, B. (2003). Randomized algorithms of estimation and optimization under almost arbitrary noise. M.: Nauka. Mann, G., McDonald, R., Mohri, M., Silberman, N., and Walker, D. (2009). Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, volume 22, 1231–1239. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, 817–838. Zinkevich, M., Weimer, M., Li, L., and Smola, A.J. (2010). Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, 2595–2603.