A new accelerated proximal gradient technique for regularized multitask learning framework

A new accelerated proximal gradient technique for regularized multitask learning framework

Accepted Manuscript A New Accelerated Proximal Gradient Technique for Regularized Multitask Learning Framework Mridula Verma, K.K. Shukla PII: DOI: R...

9MB Sizes 0 Downloads 79 Views

Accepted Manuscript

A New Accelerated Proximal Gradient Technique for Regularized Multitask Learning Framework Mridula Verma, K.K. Shukla PII: DOI: Reference:

S0167-8655(17)30220-9 10.1016/j.patrec.2017.06.013 PATREC 6852

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

20 October 2016 27 March 2017 15 June 2017

Please cite this article as: Mridula Verma, K.K. Shukla, A New Accelerated Proximal Gradient Technique for Regularized Multitask Learning Framework, Pattern Recognition Letters (2017), doi: 10.1016/j.patrec.2017.06.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1 Highlights • A new accelerated gradient method for regularized multitask learning framework. • It is the first time that the combination of the extra-gradient and the inertial term is analyzed for Multitask Learning problem. • Convergence and stability of the algorithm has been proved under specified conditions.

AC

CE

PT

ED

M

AN US

• Algorithm outperforms earlier methods in terms of empirical convergence rate, standard accuracy measures and computational time.

CR IP T

• Experiments are conducted on three real multitask regression and two multitask classification datasets.

ACCEPTED MANUSCRIPT 2

Pattern Recognition Letters journal homepage: www.elsevier.com

Mridula Vermaa,∗∗, K. K. Shuklaa a Dept.

of Computer Sc. and Engg., Indian Institute of Technology (BHU), Varanasi-221005, India.

ABSTRACT

CR IP T

A New Accelerated Proximal Gradient Technique for Regularized Multitask Learning Framework

ED

M

AN US

Multitask learning can be defined as the joint learning of related tasks using shared representations, such that each task can help other tasks to perform better. One of the various multitask learning frameworks is the regularized convex minimization problem, for which many optimization techniques are available in the literature. In this paper, we consider solving the non-smooth convex minimization problem with sparsity-inducing regularizers for the multitask learning framework, which can be efficiently solved using proximal algorithms. Due to slow convergence of traditional proximal gradient methods, a recent trend is to introduce acceleration to these methods, which increases the speed of convergence. In this paper, we present a new accelerated gradient method for the multitask regression framework, which not only outperforms its non-accelerated counterpart and traditional accelerated proximal gradient method but also improves the prediction accuracy. We also prove the convergence and stability of the algorithm under few specific conditions. To demonstrate the applicability of our method, we performed experiments with several real multitask learning benchmark datasets. Empirical results exhibit that our method outperforms the previous methods in terms of convergence, accuracy and computational time. c 2017 Elsevier Ltd. All rights reserved.

PT

1. Introduction

AC

CE

Multitask learning is a field of machine learning where the primary task of learning involves multiple related subtasks of learning classification or regression models, and this process of learning models is performed jointly. One approach may be to consider each subtask individually under the framework of single task learning, and after the operation of learning the results from individual subtasks are combined. However, this approach does not consider the relatedness of the subtasks while learning. In multitask learning we utilize the internal relatedness between tasks during the process of learning so that the performance of individual tasks can be enhanced. Various real-world applications of such algorithms include recognition (Ando and Zhang, 2005), recommender systems (Argyriou et al., 2008; Li et al., 2009), natural language processing (Daum´e et al., 2010; Collobert and Weston, 2008), computational biology (Bickel et al., 2008; Zhou et al., 2011), web search ranking (Chapelle et al.,

∗∗ Corresponding

author: Tel.: +91-955-473-3622 e-mail: [email protected] (Mridula Verma)

2010), etc. The basic framework we consider in this paper, is as follows, min F(x) = f (x) + g(x), x

(1)

where f (x) is a smooth convex loss function with gradient having Lipschitz constant L and g(x) is a convex non-smooth convex function. Various efficient optimization algorithms that solve the problem (1) are already available in the literature. Interior point methods (Berwin A. Turlach, 2005) are one of the algorithms that compute results with higher accuracy. However, in each step, it requires a Hessian matrix, which demands a huge amount of memory. Alternatively, gradient-based methods may be used. These are first order methods due to their gradient and subgradient requirements. Other proposed methods include projected subgradient method (Quattoni et al., 2009), blockwise coordinate descent algorithm (Liu et al., 2009), forwardlooking subgradients (Duchi and Singer, 2009), proximal gradient methods (Combettes and Pesquet, 2011) etc. One of the major drawbacks of traditional proximal gradient algorithm is the slow convergence, which can be removed by replacing such

ACCEPTED MANUSCRIPT 3 An L−Lipschitz operator is called a non-expansive operator if L = 1 and contraction if L < 1. T is monotone if it satisfies, (x, y), (x0 , y0 ) ∈ G(T ) ⇒ hx − x0 , y − y0 i ≥ 0, where G(T ) = {(x, y) ∈ H × H : x ∈ D(T ), y ∈ T x} is its graph, which is also a monotone set in H × H, and D(T ) is the domain of operator T . In the absence of the regularization function g(·) in (1), we obtain an overfit model, that performs poorly with the unseen samples. To obtain a generalized model, we add a regularization term with a controlling parameter ρ > 0, as follows, min F(x) = f (x) + ρ g(x). (2)

CR IP T

x∈Rd

The problem under consideration assumes g(·) = k · k1 , which provides a sparse solution. The traditional approach to solve this problem is to apply the basic subgradient descent method is proposed in (Bertsekas, 1999), which uses a black-box type technique to solve (2) with the subgradients of non-smooth functions. The subdifferential of the non-smooth function g at x (denoted as ∂g(x)) is defined by, ∂g(x) = {y | g(z) ≥ g(x) + yT (z − x) ∀z ∈ dom g}.

CE

PT

ED

M

AN US

methods with the acceleration gradient methods such as (Beck and Teboulle, 2009; Nesterov, 2013; Chambolle and Dossal, 2015). A general notion of employing the task relatedness is to use a proper regularization function, which reduces the overall formulation into a regularized risk minimization problem. In this work, we consider convex non-smooth regularizers, for example, l1 -norm based functions. Lasso (Tibshirani, 1994) belongs to such class of problems, where the loss function is the least square (convex smooth) function which is minimized along with an l1 norm. The main reason to employ the l1 norm is to induce sparsity while learning the model, which is suitable for large dimensional problems. In multitask learning setting, various lasso extensions are available such as Group Lasso formulation (Yuan and Lin, 2005) via the l1 − l2 block-norm (Lounici et al., 2009) or the l1 − l∞ block-norm (Negahban and Wainwright, 2009; Berwin A. Turlach, 2005), tree structure (Han and Zhang, 2015), graph structure (Argyriou et al., 2013), flexible sparsity structure (Kim and Xing, 2010; Chen et al., 2012) etc. In this paper, we follow the original lasso framework; however, it can also be extended for more advanced lasso frameworks. In this paper, we introduce a new accelerated proximal gradient method and apply it to solve the regularized multitask learning problem. We also prove the convergence and the stability of the algorithm under specific conditions. The proposed method belongs to the class of extra-gradient based fixed point iterations with inertial. To the best of our knowledge, this is the first time when such combination of extra-gradient with the inertial step is proposed and applied to the regularized multitask regression problem. The superiority of the proposed algorithm is established using three multitask regression and two classification datasets. The organization of this paper is as follows. Section 2 describes the multitask learning framework we are considering in this paper, the mathematical background of the problem, and few related concepts. Section 3 introduces our proposed algorithm for the problem, and the theoretical proofs of the convergence and stability under particular condition are discussed. Section 4 includes the experiments with several real datasets and detailed result analysis. Finally, we conclude our work in Section 5.

AC

2. Preliminaries

In this section, we first recall some basic facts on convex minimization problem and gradient-based methods. The convex optimization formulation of the multitask learning is not new to the field of machine learning. The algorithm we propose lies in infinite dimensional real Hilbert space H, which generalizes the notion of Euclidean space. These spaces are complete abstract vector space and have distance function induced by the inner product. Let H be a Hilbert space and T : H → H be an operator. T is called an L−Lipschtiz operator if there exists L ∈ [0, ∞) such that kT x − T yk ≤ Lkx − yk,

x, y ∈ H.

(3)

Any point y ∈ ∂g(x) is called a subgradient of g at x. The main drawback of this algorithm is that it is not able to learn the sparsity structure (Bach et al., 2012) and is slow to converge. In proximal algorithms, in place of computing ∂g, we compute the resolvent of ∂g as (I + λ∂g)−1 , with parameter λ > 0, which is called the proximal operator (denoted as proxλg ) w.r.t λ g. The proximal operator proxλg is computed as follows, proxλg (v) = arg min g(x) + x

kx − vk2 , v ∈ Rd . 2λ

(4)

In order to compute the exact value of the proximity operator, formulations are available in the literature. For example, in the case of the l1 norm, the proximity operator is defined as a soft thresholding operator. In the field of Fixed Point Theory, the generalized problem (1) can also be interpreted as a problem of finding zeros, i.e. with x∗ minimizing ( f + g) if and only if 0 ∈ ∇ f (x∗ ) + ∂g(x∗ ), where ∇ f and ∂g refer to the gradient and sub-gradient of functions f and g respectively. Now for any λ > 0 and In×n be the identity matrix, optimality condition holds if, x∗ = proxλg (x∗ − λ ∇ f (x∗ )) . The Proximal Gradient Algorithm (PGA) solves the above fixed point problem with the following iterative scheme, xn+1 = proxλg (I − λ∇ f )(xn ) = (I + λ ∂g)−1 (I − λ∇ f )(xn ), (5) where xn denotes the nth iteration of the algorithm. Generalizing this whole concept for the infinite dimensional Hilbert spaces, our proposed model finds zero of the sum of two monotone operators by solving 0 ∈ Az + Bz for z ∈ H, where A and B are two monotone operators in H. Specifically, in equation 5, A = ∂g and B = ∇ f for Rd . In addition, the

ACCEPTED MANUSCRIPT 4

xn+1 = T n (xn ).

(6)

The iterative scheme given in (6) is called the Picard fixed point iteration. In the field of fixed point theory, this scheme plays an important role because of the Banach fixed point theorem, which states that for contraction type mappings this scheme converges to a fixed point in Banach space. Many iterative schemes are proposed thereafter for different mappings and spaces. In (Sahu, 2011), author presented normal s-iteration algorithm as follows, xn+1 = T n ((1 − βn )xn + βn T n xn ),

(7)

yn = xn + αn (xn − xn−1 ),

(8)

xn+1 = proxλg (I − λ∇ f )(yn ) ∀n ∈ N, √

AC

CE

PT

ED

M

  1+ 1+4tn2 . Other definitions where αn = tn−1tn−1 with tn+1 = 2 for αn are also available, for example (Chambolle and Dossal, 2015). In the infinite dimensional Hilbert spaces, these methods are called the inertial-based forward-backward splitting methods. Such methods can be applied to multitask lasso framework, which is very popular in multitask learning literature. Let t be the number of tasks, m be the total number of samples or examples, d be the number of dimensions for all the tasks, X is the collection of examples and Y is the collection of corresponding labels. Consider a multitask lasso framework with T related tasks, where the training dataset for tth task is denoted by Dt = Xt , Yt , where Xt is a collection of samples xti , for xti ∈ Rd and Yt is a collection of corresponding labels yti for i = 1, · · · , mt } ∀ t = 1, · · · , T . Here, the pair (xti , yti ) represents ith input/output pair of tth task. The main objective of multitask lasso problem is to jointly learn models for all the related subtasks, such that the models of individual tasks can utilize the shared information. The minimization problem estimates parameter W from the training examples by solving the following, min F(W) = f (W) + g(W) = W

xn+1 = h(T, xn ) for n ∈ N,

(10)

where h is some function, that defines the iteration scheme. Suppose {xn }n∈N ∈ H converges to a fixed point p of T . Let {yn }n∈N ∈ H and set n := d(yn+1 , h(T, yn )) for n ∈ N, where d(·) is a distance function (k · k in real normed linear space). Then, the iteration process (10) is said to be T -stable or stable with respect to T if lim n = 0 implies lim yn = p. n→∞ n→∞ We end this section with the following lemma, which we will use in our theoretical proofs. Lemma 1. (Berinde, 2002) Let δ be a real number satisfying 0 ≥ δ < 1 and {n }∞ n=0 be a sequence of positive numbers such that limn→∞ n = 0. Then, for any sequence of positive numbers {un }∞ n=0 satisfying un+1 ≥ δun + n , for n = 0, 1, · · · , we have limn→∞ un = 0. 3. nAGA Algorithm for Multitask Learning

AN US

where βn ∈ (0, 1). Author also claimed that the proposed fixed point iterative scheme converges faster than the Picard iterative scheme. In order to accelerate the convergence of PGA, the following acceleration gradient algorithm (AGA) was proposed in (Beck and Teboulle, 2009),

{xn }n∈N ∈ H be the sequence generated by an iteration procedure involving T which is defined by,

CR IP T

PGA method 5 is called the forward-backward splitting algorithm in the infinite dimensional Hilbert spaces and the operator (I + λ ∂g)−1 (I − λ∇ f ) is called the forward-backward operator. We assumed that the parameter λ is updated at each iteration, i.e. we use parameter λn and corresponding operator as T n . Thus, the iterative procedure in (5) can be rewritten as,

1X kXt Wt − Yt k2F + ρkWk1 . (9) 2 t

Note that Xt ∈ Rmt ×d , Wt ∈ Rd and Yt ∈ Rmt corresponding to the tth task. The ρ parameter is the sparsity controlling parameter and k · kF is the Frobenious norm. The concept of T -stability of an iteration process is defined as follows. Let T : H → H be a non-expansive operator. Let

In this section, we give the detailed explanation of the proposed new accelerated proximal gradient algorithm (nAGA) along with the methodological assumptions for solving the multitask lasso problem (9) and present its stability analysis. The following assumptions are made for the proposed algorithm: 1. The loss function f (W) is convex and continuously differentiable with Lipschitz continuous gradient L f : k∇ f (A) − ∇ f (B)k ≤ L f kA − Bk ∀A, B ∈ Rd .

2. The regularization function g(W) is a continuous convex function which may be non-smooth. 3. Problem (9) is solvable, i.e., W ∗ = argmin F(W) , φ. 4. The sequence {αn } ∈ (0, 1), for n = 1, 2, · · · N is nondecreasing. For any W0 and W1 ∈ H, the nAGA iterative algorithm is defined as follows: Pn = Wn + αn (Wn − Wn−1 )

Qn = proxρcn k·k1 (Pn − cn (X T XPn − X T Y)) Rn = (1 − βn )Pn + βn Qn

(11)

Wn+1 = proxρcn k·k1 (Rn − cn (X T XRn − X T Y)) where αn and βn ∈ (0, 1) and the value of proxρcn k·k1 is computed as discussed in previous section. The term (Wn − Wn−1 ) introduces an inertial step that, with proper parametric settings, produces acceleration. It should be noted that the term αn is a generalized term, that can have different definitions, such as in (Beck and Teboulle, 2009; Chambolle and Dossal, 2015). The behavior of the term αn is shown in figure (1). The pseudo-code of the algorithm is given in Algorithm 1. We have considered that the value of the Lipschitz constant L is not known in advance and thus we adapt the backtracking line search in each iteration. Also, to establish the convergence, we ensure that value of cn belongs to set (0, 2/L]. It is important

ACCEPTED MANUSCRIPT 5 Algorithm 1 nAGA for MTL 1: procedure nAGA(β1 , α1 , ρ, tol) 2: W0 = W1 ∈ H, c1 = 1 3: repeat 4: Find cn using backtracking stepsize rule 5: Pn = Wn + αn (Wn − Wn−1 ) 6: Qn = proxρcn k·k1 (Pn − cn (X T XPn − X T Y)) 7: Rn = (1 − βn )Pn + βn Qn 8: Wn+1 = proxρcn k·k1 (Rn − cn (X T XRn − X T Y)) 9: until F(Wn ) − F(Wn−1 ) ≤ tol 10: return Wn+1 11: end procedure

It should be noted from (Argyriou et al., 2011) that the algorithm we proposed in previous section is for the non-expansive operators. This property of non-expansivity of the operator T n holds if 0 ≤ cn ≤ 2/L. However, to prove the convergence and stability, we assume the contraction property of the operators, which holds for 0 < cn < 2/L. Also, in order to prove the convergence and stability we assume the following condition, kxn − xn−1 k/αn → 0 as n → ∞.

(14)

CR IP T

Theorem 1. Let H be a Hilbert space and T : H → H is a contraction mapping that satisfies the condition (12) for 0 ≤ δ < 1 and fixed point p. For the initial points x0 and x1 , let {xn }∞ n=0 is a sequence generated by (11), where αn and βn ∈ (0, 1]. Let the sequence satisfies condition (14). Then, 1. {xn }∞ n=0 converges strongly to p. 2. the iterative scheme in (11) is T -stable. Proof. 1. From (12) and (13), we have

kxn+1 − pk = kT n [(1 − βn )yn + βn T n yn ] − pk ≤ δ k((1 − βn )yn + βn T n yn ) − pk

≤ δ ((1 − βn ) kyn − pk + βn kT n yn − pk)

M

Fig. 1. Illustration of the behavior of sequence αn vs. iterations.

AN US

≤ δ ((1 − βn (1 − δ)) kyn − pk) h ≤ δ (1 − βn (1 − δ)) kxn − pk

CE

PT

ED

to note that the computational cost of each iteration is slightly higher than the traditional accelerated method. However, the reduction in the number of iterations due to the proposed arrangement compensates for this overhead. One of the main characteristics of the optimization algorithm is its behavior if the parameters or the sequence values are slightly modified. A stable algorithm does not get affected by such perturbation and still converges to the fixed point. In the next subsection, we will analyze the convergence and stability of the nAGA algorithm on a class of contraction operators.

AC

3.1. Convergence and Stability Analysis of nAGA As described in previous section, the contraction operator is a class of non-expansive operators. Let {xn } is a sequence in H, and T : H → H is a contraction mapping. Let p is a fixed point of T . In (Bosede and Rhoades, 2010) authors proved that every contraction mapping that has a fixed point, satisfies the following inequality, for 0 ≤ δ < 1, kp − T xk ≤ δkp − xk for x ∈ H

(12)

Let T n is the forward-backward operator with respect to cn , as described in previous section. In order to prove the theorems, we rewrite the iterative scheme given in (11) as follows, yn = xn + αn (xn − xn−1 )

xn+1 = T n [(1 − βn )yn + βn T n yn ]

(13)

i + αn (1 − βn (1 − δ)) kxn − xn−1 k h ≤ δ (1 − βn (1 − δ))kxn − pk kxn − xn−1 k i + (1 − βn (1 − δ)) αn Since {αn } and {βn } ∈ (0, 1], 0 ≤ δ < 1, from condition (14) and lemma (1), we get limn→∞ kxn+1 − pk = 0. This ends the proof. 2. Let {pn }∞ n=0 is a real sequence in H. According to (10) and (11), We define n = kpn+1 − T n [(1 − βn )qn + βn T n qn ]k, where qn = pn + αn (pn − pn−1 ). Let limn→∞ n = 0. Then we shall prove that limn→∞ pn = p. kpn+1 − pk = kpn+1 − T n [(1 − βn )qn + βn T n qn ] + T n [(1 − βn )qn + βn T n qn ] − pk

≤ n + δ k(1 − βn ) qn + βn T n qn − pk ≤ n + δ [(1 − (1 − δ) βn ) kqn − pk h ≤ n + δ (1 − (1 − δ) βn ) kpn − pk

i + αn (1 − (1 − δ) βn ) kpn − pn−1 k h ≤ n + δ (1 − (1 − δ) βn ) kpn − pk kpn − pn−1 k i + (1 − (1 − δ) βn ) αn Since {αn } and {βn } ∈ (0, 1], 0 ≤ δ < 1, from condition (14) and lemma (1), we have limn→∞ pn = p. Conversely, let limn→∞ pn = p, we shall show that limn→∞ n = 0. We have, n = kpn+1 − T n [(1 − βn )qn + βn T n qn ]k

≤ kpn+1 − pk + δ k(1 − βn )qn + βn T n qn − pk h i ≤ kpn+1 − pk + δ (1 − (1 − δ)βn ) kqn − pk

ACCEPTED MANUSCRIPT 6 ≤ kpn+1 − pk + δ (1 − (1 − δ)βn ) kpn − pk kpn − pn−1 k i + (1 − (1 − δ) βn ) αn

Since {αn } and {βn } ∈ (0, 1], 0 ≤ δ < 1, from condition (14) and our assumption, we have limn→∞ n = 0. This ends the proof. 4. Experimental Results and Detailed Analysis

AC

CE

PT

ED

M

AN US

In this section, we give the detailed description of the experimental setup, and the result analysis. All the tests are performed on Intel Core i7 processor with 10GB RAM, under MATLAB computing environment. We used three publicly available benchmark multitask regression datasets, namely School 1 (m = 15362, d = 28, t = 139), Sarcos 2 (m = 48933, d = 21, t = 51) and Parkinsons 3 (m = 5875, d = 19, t = 84) for regression, and publicly available handwritten digit recognition dataset USPS 2 (m = 9298, d = 256, t = 10) and object recognition dataset Coil20 4 (m = 1440, d = 1024, t = 20) for classification. For the classification problem, each one-vs-rest binary classification problem is considered as a task. We compared our algorithm (11) with classic Proximal Gradient Descent method without acceleration (5), Proximal Gradient method with normal s-iteration scheme (7) proposed in (Sahu, 2011) (sometimes referred as Picard-Mann hybrid scheme in the literature of fixed point theory) and Accelerated Gradient Descent (AGA) (8) from (Chambolle and Dossal, 2015). It should be noted that it is the first time that these fixed point iterative schemes are compared based on their performances for the problem of multitask regression and classification. To demonstrate the convergence of our algorithm, we performed experiments with 70% - 30% training and testing samples split for all the datasets. As a pre-processing step, zscore is performed on Xt to normalize, and a bias column is added to the data for each task to learn the bias. As a performance measure, the standard root mean square error (rMSE) is used for regression problem and precision, recall and classification error is used for classification problem. The value of sparsity controlling parameter ρ is set as {θ × ρmax }, where ρmax is set as kX T Yk∞ and the value of θ is a chosen from the range 0.001 to 1 with the precision of 0.001 for the least value of prediction accuracy. For the stopping criteria, the tolerance value tol ( the difference between two consecutive function values) is set to 10E-5, which also notifies the convergence. The maximum number of iteration is set to 10E4. All the vectors are initialized with a zero-valued vector. Values of 1 . cn and L are initialized with 1, and βn is set to (n+1) Our first result is the comparison of first order proximal gradient algorithms based on their convergence speed. Results are

shown in figure 2 in terms of log-log plot. It can be observed that the proposed algorithm consistently gives better convergence results than the previous state-of-the-art algorithms on all the datasets. It can be observed that with Sarcos and Parkinson dataset, the accelerated algorithms are forming ripples. The reason behind these ripples is that the objective function is not monotonically decreasing for the accelerated methods. Such cases can be handled using concepts discussed in (O’Donoghue and Cand`es, 2015). We will study such behavior in the future study. It is important to note that with the School and Parkinson datasets nPGA algorithm converges with the better rate than AGA. It can be interesting to investigate this result since it demonstrates the direct advantage of using the extra-gradient technique over the traditional acceleration gradient method. In figure 3, we show the graphs of the objective function values in each iteration for the three datasets. The rapid reductions in values of objective functions clearly demonstrate the efficiency of the proposed algorithm. It can be observed from the figure that the proximal gradient method with extra-gradient fixed point technique helps in reducing the function value better than the accelerated gradient technique. With all the three datasets, the rates of reduction of the function values are best for nAGA. The accuracy results are shown in figure 4. We consider the standard root mean square error (rMSE) to demonstrate the accuracy-related performance of the proposed algorithm. It can be concluded that nAGA gives better error reduction rate, i.e. the reduction in rmse per iteration on all the datasets. The detailed results are shown in table 1, which show that the final values of rmse achieved by nAGA are better than obtained by AGA. Also, the number of iterations to reach the final value is significantly lesser than that of AGA. Table 1 shows the detailed regression results for the three datasets in terms of the number of iterations to reach the convergence, i.e. the difference between progressive objective function values reaches the tolerance value (which is set as 10e-5), the rmse value, the minimum objective value obtained and the CPU time. Best values are highlighted with bold letters. It is evident from the table that nAGA takes the least number of iterations on all the datasets. For the Sarcos dataset, nAGA outperforms all the other algorithms. With School dataset, the best rmse obtained with nPGA algorithm, however the number of iterations to achieve this accuracy is significantly higher than that of nAGA. With Parkinson dataset, nPGA achieves least function value, but again with significantly higher number of iterations than nAGA. To demonstrate the performance of the classification task, we present the results in table 2, in terms of precision, recall, accuracy and training time. Given result are the average results of ten experiments. All the experimental settings for the categorization task are same as before, except that now we choose the value of θ from the range 0.001 to 1 with the precision of 0.001 for the highest value of classification accuracy. It is evident from the table that the nPGA algorithm is performing well for the task of classification in terms of the standard measures on both the datasets. However, this result is obtained after a very large number of iterations and high computational time. A sim-

CR IP T

h

1 Available at http://ttic.uchicago.edu/ argyriou/code/mtl_ ~ feat/school_splits.tar 2 Available at http://www.gaussianprocess.org/gpml/data/ 3 Available at http://archive.ics.uci.edu/ml/datasets/ Parkinsons+Telemonitoring 4 Available at http://featureselection.asu.edu/datasets.php

ACCEPTED MANUSCRIPT 7

(a) Sarcos Dataset

(b) School Dataset

(c) Parkinsons Dataset

AN US

CR IP T

Fig. 2. The performance of nAGA by F(Wn ) − F(W ∗ ) on the Sarcos, School and Parkinsons Datasets, respectively. F(Wn ) is the function value achieved at nth iteration and F(W ∗ ) is the optimal function value). Shown graphs are log-log plots.

(a) Sarcos Dataset

(b) School Dataset

(c) Parkinsons Dataset

PT

ED

M

Fig. 3. The performance of nAGA by Reduction in Function Value in each Iteration on the Sarcos, School and Parkinsons Datasets, respectively.

(b) School Dataset

CE

(a) Sarcos Dataset

(c) Parkinsons Dataset

Fig. 4. The performance of nAGA by rMSE reduction rate on the Sarcos, School and Parkinsons Datasets, respectively.

AC

ilar performance is obtained with the nAGA in a significantly lesser number of iterations and computational time. 5. Conclusion

In this paper, we proposed a novel accelerated proximal gradient method nAGA for the regularized multitask learning task with non-smooth regularizers. We also proved that the proposed algorithm is T −stable with respect to the contraction mapping T , which in this work corresponds to a class of forward-backward operator (the operator utilized in proximal gradient techniques). Experiments are discussed with several publicly available multitask regression and classification benchmark datasets. Empirical results directly imply that algorithm

nAGA converges even faster than the traditional accelerated proximal gradient method under multitask learning framework. It also improves the prediction as well as classification accuracy for all the datasets. In future, we aim to perform the theoretical convergence rate analysis for the proposed algorithm. References Ando, R.K., Zhang, T., 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853. Argyriou, A., Cl´emenc¸on, S., Zhang, R., 2013. Learning the Graph of Relations Among Multiple Tasks. Research Report. Argyriou, A., Evgeniou, T., Pontil, M., 2008. Convex multi-task feature learning. Mach. Learn. 73, 243–272.

ACCEPTED MANUSCRIPT 8 Table 1. Regression Results. Best values are highlighted with bold letters.

School

Parkinson

PGA 2268 1.48297E6 1.20762 11.9376 320 657477.98 9.26612 0.9745 912 27311.87 1.19006 3.8187

Table 2. Classification Results. Best values are highlighted with bold letters.

nPGA 10E4 0.9238 0.8915 0.9763 242.8 10E4 0.9901 0.9992 0.9998 388.27

AGA 1820 0.9327 0.8619 0.9738 23.2 1025 0.5797 0.9445 0.9647 36.11

nAGA 683 0.9476 0.8750 0.9764 22.1 989 0.9899 0.9964 0.9995 35.60

M

Coil20

PGA 10E4 0.7289 0.8615 0.9459 179.6 10E4 0.9839 0.9992 0.9986 293.12

CE

PT

ED

Argyriou, A., Micchelli, C.A., Pontil, M., Shen, L., Xu, Y., 2011. Efficient first order methods for linear composite regularizers. CoRR abs/1104.1436. Bach, F., Jenatton, R., Mairal, J., Obozinski, G., 2012. Structured sparsity through convex optimization. Statist. Sci. 27, 450–468. doi:10.1214/ 12-STS394. Beck, A., Teboulle, M., 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 183–202. Berinde, V., 2002. On the stability of some fixed point procedures. Bul. Stiint. - Univ. BaiaMare, Ser. B, Matematica-Informatica 18, 7–14. Bertsekas, D., 1999. Nonlinear Programming. Athena Scientific. Berwin A. Turlach, William N. Venables, S.J.W., 2005. Simultaneous variable selection. Technometrics . Bickel, S., Bogojeska, J., Lengauer, T., Scheffer, T., 2008. Multi-task learning for hiv therapy screening, in: Proceedings of the 25th International Conference on Machine Learning, ACM, New York, NY, USA. pp. 56–63. Bosede, A.O., Rhoades, B., 2010. Stability of picard and mann iteration for a general class of functions. Journal of Advanced Mathematical Studies 3, 23–26. Chambolle, A., Dossal, C., 2015. On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. Journal of Optimization Theory and Applications 166, 968–982. Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., Tseng, B., 2010. Multi-task learning for boosting with application to web search ranking, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 1189–1198. Chen, X., Lin, Q., Kim, S., Carbonell, J.G., Xing, E.P., 2012. Smoothing proximal gradient method for general structured sparse learning. CoRR abs/1202.3708. Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning, in: Proceedings of the 25th International Conference on Machine Learning, ACM, New York, NY, USA. pp. 160–167.

AC

AGA 1498 1.50807E6 1.19389 6.8407 134 648277.49 9.27501 0.8888 218 6387.36 1.11445 0.9391

nAGA 755 1.4812E6 1.19308 4.7262 67 646697.63 9.26553 0.6144 137 6373.89 1.11348 0.6318

Combettes, P.L., Pesquet, J.C., 2011. Proximal Splitting Methods in Signal Processing. Springer New York, New York, NY. pp. 185– 212. URL: http://dx.doi.org/10.1007/978-1-4419-9569-8_10, doi:10.1007/978-1-4419-9569-8_10. Daum´e, III, H., Kumar, A., Saha, A., 2010. Frustratingly easy semi-supervised domain adaptation, in: Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA. pp. 53–59. Duchi, J., Singer, Y., 2009. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934. Han, L., Zhang, Y., 2015. Learning tree structure in multi-task learning, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 397–406. Kim, S., Xing, E.P., 2010. Tree-guided group lasso for multi-task regression with structured sparsity, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 543–550. Li, B., Yang, Q., Xue, X., 2009. Transfer learning for collaborative filtering via a rating-matrix generative model, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA. pp. 617–624. Liu, H., Palatucci, M., 0003, J.Z., 2009. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery., in: Danyluk, A.P., Bottou, L., Littman, M.L. (Eds.), ICML, ACM. pp. 649–656. Lounici, K., Pontil, M., Tsybakov, A.B., van de Geer, S., 2009. Taking advantage of sparsity in multi-task learning, in: Proceedings of the 22nd Conference on Information Theory, pp. 73–82. Negahban, S., Wainwright, M.J., 2009. Phase transitions for high-dimensional joint support recovery, in: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (Eds.), Advances in Neural Information Processing Systems 21. Curran Associates, Inc., pp. 1161–1168. Nesterov, Y., 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140, 125–161. doi:10.1007/ s10107-012-0629-5. O’Donoghue, B., Cand`es, E., 2015. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics 15, 715–732. URL: http://dx.doi.org/10.1007/s10208-013-9150-3, doi:10. 1007/s10208-013-9150-3. Quattoni, A., Carreras, X., Collins, M., Darrell, T., 2009. An efficient projection for l1,infinity regularization., in: Danyluk, A.P., Bottou, L., Littman, M.L. (Eds.), ICML, ACM. pp. 857–864. Sahu, D.R., 2011. Applications of the s-iteration process to constrained minimization problems and split feasibility problems. Fixed Point Theory 12, 187–204. Tibshirani, R., 1994. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288. Yuan, M., Lin, Y., 2005. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. Zhou, J., Yuan, L., Liu, J., Ye, J., 2011. A multi-task learning formulation for predicting disease progression, in: Proceedings of the 17th ACM SIGKDD

AN US

USPS

# Iter Precision Recall Accuracy Time # Iter Precision Recall Accuracy Time

nPGA 1183 1.482E6 1.21107 7.9836 194 647327.38 9.26513 1.4961 692 6346.92 1.11623 2.5671

CR IP T

Sarcos

# Iter optFV rMSE time # Iter optFV rMSE time # Iter optFV rMSE time

ACCEPTED MANUSCRIPT 9

AC

CE

PT

ED

M

AN US

CR IP T

International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 814–822.