Efficient Lasso training from a geometrical perspective

Efficient Lasso training from a geometrical perspective

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Efficient ...

599KB Sizes 0 Downloads 25 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Efficient Lasso training from a geometrical perspective Quan Zhou, Shiji Song n, Gao Huang, Cheng Wu Department of Automation, Tsinghua University, Beijing 100084, China

art ic l e i nf o

a b s t r a c t

Article history: Received 14 February 2015 Received in revised form 23 May 2015 Accepted 29 May 2015 Communicated by K. Li

The Lasso (L1-penalized regression) has drawn great interests in machine learning and statistics due to its robustness and high accuracy. A variety of methods have been proposed for solving the Lasso. But for large scale problems, the presence of L1 norm constraint significantly impedes the efficiency. Inspired by recent theoretical and practical contributions on the close relation between Lasso and SVMs, we reformulate the Lasso as a problem of finding the nearest point in a polytope to the origin, which circumvents the L1 norm constraint. This problem can be solved efficiently from a geometric perspective using the Wolfe's method. Comparing with least angle regression (LARS), which is a conventional method to solve Lasso, the proposed algorithm is advantageous in both efficiency and numerical stability. Experimental results show that the proposed approach is competitive with other state-of-theart Lasso solvers on large scale problems. & 2015 Elsevier B.V. All rights reserved.

Keywords: Lasso Reduction Feature selection Large scale problem

1. Introduction Feature selection has been widely used in many machine learning applications where a subset of features should be selected in the face of thousands or even millions of features. An effective and efficient feature selection method can greatly improve the generalization of learning algorithms and reduce the testing time cost [16,15,7]. Among existing feature selection algorithms, the Lasso [11,12,19] has been demonstrated as the most practical one because of its robustness and high accuracy performance. Due to its wide applications, there are many algorithms proposed for solving the Lasso in recent years. LARS [4], proposed by Efron et al., is a well-known algorithm to solve the Lasso. It proceeds in a direction equiangular between selected features and computes the regularization path efficiently. The L1_LS algorithm [6], first transforms the Lasso to its dual form directly and then uses a log-barrier interior point method for optimization. The optimization is based on the Preconditioned Conjugate Gradient (PCG) method to solve Newton steps, which is suitable for large scale sparse compressed sensing problems. Liu et al. [9] proposed the Sparse Learning with Efficient Projections (SLEP) algorithm to solve the Lasso by computing the Euclidean projection onto a L1 ball, which can be efficiently solved via bisection. In the literature, the coordinate gradient descent algorithm has become a popular strategy for solving the Lasso problem [17]. It

optimizes over one variable at a time. Though it may require many coordinate updates to converge, the coordinate gradient descent algorithm can solve large scale Lasso very efficiently since each update can be trivially solved. Due to its inherent sequential nature, the coordinate descent algorithm is hard to parallelize. The Shotgun algorithm, [3], is among the first to parallelize coordinate descent for Lasso. This implementation can handle large scale sparse data sets that other solvers, cannot cope with due to memory constraints. In this paper, we provide a new perspective to solve the Lasso problem. Our work is inspired by recent theoretical and practical contributions, which reveal the close relation between the Lasso and SVMs [5,18]. Combined with some geometric interpretation [14], we extend this line of work and derive a practical algorithm. Experimental results show that the proposed algorithm has better numerical stability than LARS and is competitive with other Lasso solvers on large scale problems. Although there are many well-known algorithms to solve the Lasso, we think our algorithm could be used as an alternative one and enhance the geometrical understanding of the Lasso. This paper is organized as follows. Section 2 gives a brief overview of the Lasso. In Section 3, we reformulate Lasso and solve it using Wolfe's method. Section 4 presents the experimental results and we conclude our work in Section 5.

2. Notation and background n

Corresponding author. E-mail addresses: [email protected] (Q. Zhou), [email protected] (S. Song), [email protected] (G. Huang), [email protected] (C. Wu).

Throughout this paper we type vectors in bold (a), scalars in regular (C or t), matrices in capital bold (A). Specific entries in vectors or matrices are scalars and follow the corresponding

http://dx.doi.org/10.1016/j.neucom.2015.05.103 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i

Q. Zhou et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

convention, i.e. the ith dimension of vector a is ai. In contrast, depending on the context, ai refers to the ith column in matrix A and aðiÞ refers to the transpose of its ith row. 1m is a column vector of all 1 with length m. In the remainder of this section we briefly review the Lasso. In the regression scenario we are provided with a data set n fðaðiÞ ; bi Þgi ¼ 1 , where each aðiÞ A Rm and the labels are real valued, i.e. bi A R. Let b ¼ ðb1 ; …; bn Þ > be the response vector and A A Rnm be the design matrix where the (transposed) ith row of A is aðiÞ . We assume throughout that the response vector is centered and all features are normalized as follows: n X

bi ¼ 0;

i¼1

n X

aij ¼ 0 and

i¼1

n X

a2ij ¼ 1

for j ¼ 1; …; m;

ð1Þ

i¼1

where aij denotes the jth feature of aðiÞ . The Lasso [11] learns a (sparse) linear model to predict bi from aðiÞ by minimizing the squared loss with an L1-norm constraint, minw A Rm s:t:

J Aw  b J 22

j wj 1 r t;

min

‖Aw  b‖22 þ λ j wj 1 ;

where λ Z 0. There is a one-to-one correspondence between (2) and (3). In the rest of our paper, we mainly focus on the constrained problem (2).

3. Reformulation as a nearest point problem In this section, we make a series of equivalent transformations and reduce (2) to the problem of finding the nearest point in a polytope to the origin, or nearest point problem (NPP) for short. 3.1. Reformulation of the Lasso We start with the Lasso formulation as stated in (2). First, we divide the object function and the constraint by t and substitute in a rescaled weight vector, w≔ð1=tÞw. This step allows us to absorb the constant t into the object function and we can rewrite 2 as  2  1   minm  b Aw   t 2 wAR s:t:

j wj 1 r 1:

> ^ > ^ w ¼ 1, we can expand b ¼ b12m w, (6) becomes As 12m   2    A  1b1 > ;  A  1b1 > w ^ min   ^ i ARþ t m t m w 2 2m X ^ i ¼ 1: s:t: w

ð7Þ

i¼1 > We construct a matrix P ¼ ½A^ 1 ;  A^ 2 , where A^ 1 ¼ A  1t b1m and > 1 ^ A 2 ¼ A þ t b1m . If we substitute P into (7) it becomes

min

^ i ARþ w

^ 22 ‖Pw‖ 2m X

^ i ¼ 1: w

ð8Þ

i¼1

ð2Þ

ð3Þ

ð6Þ

i¼1

s:t:

where w ¼ ½w1 ; …; wm  > A Rm denotes the weight vector, t 40 is the L1-norm budget, which is often specified by users. Equivalently, the solution to (2) also minimizes the unconstrained version of the problem w A Rm

constraint into (5) and obtain  2  1  ½A;  Aw  ^ min b   ^ i ARþ t 2 w 2m X ^ i ¼ 1: w s:t:

we obtain a polytope fxj x ¼ P2m ^ ^ i ¼ Pw; ^ ^ pi w i ¼ 1 w i ¼ 1; w i Z 0g, whose vertices consist of 2m points fp1 ; p2 ; …; p2m g in Rn . For a better understanding of the reduction process, we visuallize it using Fig. 1. First we have m points fa1 ; …; am g, flip them through the origin point o and we get m new points f  a1 ; …;  am g. Problem (6) is to find the nearest point in the convex hull of fa1 ; …; am ;  a1 ; …;  am g from point ð1=tÞb. Then we shift the origin point o to point ð1=tÞb, each point in fa1 ; …; am ;  a1 ; …; am g need to subtract ð1=tÞb, and we get problem (7). Finally we use new notation fp1 ; …; pm ; pm þ 1; …; p2m g to replace fa1  ð1=tÞb; …; am  ð1=tÞb;  a1  ð1=tÞb; …;  am  ð1=tÞbg, and problem (8) is equivalent to finding the nearest point in the convex hull of fp1 ; …; pm ; pm þ 1 ; …; p2m g to the origin. We summarize the reduction steps in Algorithm 1. Since the proposed method has an intuitive geometrical interpretation, We refer it as GeoLasso. The problem (8) is, of course, a problem of quadratic programming, for which there are several efficient algorithms, such as interior point method [2]. But our interest is to further explore the geometrical properties of this problem, and to derive a special solver using Wolfe's method. So P2m

i¼1

Algorithm 1. Sketch of GeoLasso. Input: variable b dependent on variables A, Output: Lasso solution: the output weights w 1: P ¼ ½A  1t b1 > ;  A  1t b1 > ;

ð4Þ

To simplify the L1 constraint, we follow [10] and split w into two sets of non-negative variables, representing positive components w þ Z 0 and negative components w  Z0, i.e. w ¼ w þ  w  . Then we stack w þ and w  together and form a new weight vector 2m 2m ^ ¼ ½w þ ; w   A R2m w Z 0 . Here the set R Z 0 denotes all vectors in R with all nonnegative entries. We can rewrite (4) as  2  1  ½A;  Aw  ^ b  min  ^ i ARþ t 2 w s:t:

2m X

^ i r1: w

ð5Þ

i¼1

Except for the case with extremely large t≫0, the L1-norm ^ j ¼ 1. (If t is constraint in (2) will always be tight [2], i.e. j w extremely large, (2) is equivalent to ordinary least square problem, the constraint can be removed.) We can incorporate this equality

Fig. 1. Geometry interpretation of the reduction process.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i

Q. Zhou et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

method. We summarize it in Algorithm 2,1, which has the following steps. Algorithm 2. Pseudo-code of nearest point solver. ^ S ¼ ½1; 1: J 0 ¼ argminf‖pj ‖2 ; gj ¼ 1; …; 2mg, S ¼ fJ 0 g and w 2: loop ^ S; 3: x ¼ PS w 4: J ¼ argminfx > pj ; j ¼ 1; …; 2mg; 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

Fig. 2. Illustration of how algorithm 2 finding the nearest point in a polytope.

Table 1 Step of Algorithm 2 in a simple example. Step

x

Q

1 2 3 2 3 4 3 2

pm þ 1 pm þ 1 r r r s u

pm þ 1 pm þ 1 ; p4 pm þ 1 ; p4 pm þ 1 ; p4 ; pm þ 2 pm þ 1 ; p4 ; pm þ 2 p4 ; pm þ 2 p4 ; pm þ 2 Stop

y

if x > pj 4 x > x  δ1 max f‖pj ‖2 ; j A fS; Jgg then ^ and break; return w end if S ¼ S [ fJg; loop Solve eq. (12) for v; if 8 i; vi 4 δ2 then ^ S ¼ v and break; return w else D ¼ fij wi vi 4 δ3 ; i A Sg; θ ¼ minð1; min fwi =ðwi  vi Þ; i A DgÞ; ^ S ¼ ð1  θÞw ^ S þ θv; w end if end loop end loop

Step 1. Step 2.

r 0 u

Step 3.

Table 2 Data sets used for speed comparison

Step 4.

Data set

#Instances

#Features

GLI-85 SMK-CAN-187 GLA-BRA-180 PEMS Scene15 Dorothea

85 187 180 440 544 800

22,283 19,993 49,151 138,672 71,963 88,119

Fig. 2 and Table 1 illustrate the algorithm in a simple example, giving the current x, y and Q at the end of each step. Minimal norm point in AðQÞ Steps in Algorithm 2 is trivial to implement except finding the minimal norm point y in the affine hull AðQÞ at step 3. We first give the proof that Q is affinely independent in Lemma 1, and then show how to find the minimal norm point in AðQÞ when Q is affinely independent.

^ ¼NearestPointSolver(P); 2: w ^ ^ ^ 3: w ¼ t n ðwð1 : mÞ  wðm þ 1 : 2mÞÞ=sumðwÞ;

Lemma 1. Convergence [14]. The selected subset Q in Algorithm 2 is always affinely independent and the algorithm will terminate in finite steps.

3.2. Nearest point in a polytope For better understanding the optimization of our algorithm, first we introduce some definitions. Let P ¼ fp1 ; p2 ; …; p2m g be a finite point set in Rn . > w ¼ 1g; AðPÞ ¼ fx : x ¼ Pw; 12m

ð9Þ

is the affine hull of P; and > w ¼ 1; w Z 0g; CðPÞ ¼ fx : x ¼ Pw; 12m

By reduction of the Lasso, we get 2m points pj . Choosing the point of minimal norm, here we denote it x and let Q ¼ fxg. If x ¼ 0 or the hyperplane HðxÞ ¼ fx^ : x > x^ ¼ J x J 2 g separates P from the origin point, stop; otherwise, find the point with the minimal inner product x > pj , add this point to Q. Find the minimal norm point y in the affine hull AðQÞ. If y is also in the relative interior of convex hull CðQÞ, set x ¼ y and return to Step 2, otherwise go to Step 4. Find the point z nearest to y in the intersection of convex hull CðQÞ and line segment xy, set x ¼ z and delete the point in Q not on the face of CðQÞ which z lies, then go to Step 3.

ð10Þ

is the convex hull of P. A point set Q ¼ fq1 ; q2 ; …; qk g is called affinely independent if no point in Q belongs to the affine hull of the remaining points. An affinely independent subset Q of P is called a corral [14] if the minimal norm point of CðQÞ is in Q's relative interior. The reduction in our previous section enables us to derive a efficient algorithm to solve the Lasso using the efficient Wolfe's

Proof. First, we will show that Q is always affinely independent. It is trivial that Q is still affinely independent after deleting a point. When adding a point pj to Q, pj 2 = HðxÞ as pj on the near side of HðxÞ. Since x is of minimal norm in AðQÞ, we have line ox⊥ AðQÞ then AðQÞ D HðxÞ. Therefore pj 2 = AðQÞ, Q \ pj is affinely independent. Second, the total number of inner cycles cannot exceed the number of outer cycles. Each inner cycle will decrease one point in Q while each outer cycle in step 1 will add one point in Q the final data set Q will have at least one point, so the total number of inner cycles cannot exceed the number of outer cycles. Third, the number of outer cycles is finite. In step 2, if y is in the relative interior of CðQÞ, then x is replaced by y and the norm of x decrease. Otherwise, x will be replaced by a point on line xy, so the norm of x will also decrease. Each corral CðQÞ has a unique 1 S is the index set of selected points pj , which is a subset of the indices f1; 2; …; 2mg. δ1 ; δ2 ; δ3 is set to small number for controlling the numerical errors.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i

Q. Zhou et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

Fig. 3. The regularization paths of LARS (left) and GeoLasso (right) on the prostate data set. Each line corresponds to the value of wi as a function of the L1 budget t. The two algorithms match exactly for all values of t.

Fig. 4. The condition number of LARS and GeoLasso on the data sets Scene15 (left) and Dorothea (right). The number of points in LARS is less than GeoLasso because LARS will become singularity and could not find a positive direction towards the next event.

solution corresponding to x, which means each corral cannot enter into the algorithm twice, meanwhile the number of corrals is finite, so the algorithm will stop in finite steps.□

order k þ 1 " # 0 1k>

Since Q in Algorithm 2 is always affinely independent, we can formulate the problem of finding the minimal norm point in AðQÞ as follows:

be nonsingular. With the nonsingular property of the matrix (15), we can easily get the unique solution to v based on (13).

min ‖Qv‖22

3.3. Time complexity

v A Rk

s:t:

k X

vi ¼ 1:

ð11Þ

i¼1

Differentiating the Lagrangian function ‖Qv‖22 þ 2λð1k> v  1Þ, we get the necessary conditions 1k> v ¼ 1 1k λ þQ > Qv ¼ 0; which could be denoted as " #" #   0 1k> 1 λ ¼ : 0 v 1k Q > Q

ð12Þ

ð13Þ

We know that affine independence of a set Q of k points is equivalent to the property that the ðn þ 1Þ  k matrix " ># 1k ð14Þ Q has rank k, as well as to the property that the symmetric matrix of

1k

ð15Þ

Q >Q

n

Given data samples fðaðiÞ ; bi Þgi ¼ 1 A Rm  R for modeling, assuming that k features are selected. The construction of the matrix P only requires O(nm) operations and the majority of the running time will, in all cases, be spent in the outer cycles. At each loop, we need take O(nm) operations to find the point pj of minimal inner product with x in Algorithm 2, which is the most time consuming part. Finding the nearest point v in the affinely independent data 2 set fpi ; i A Sg requires at most Oðk Þ operations. If we stop the 2 algorithm after K steps, then it requires OðKðnm þk ÞÞ operations. It is worth mentioning that the computation of x > pj ; j ¼ 1; …; 2m could be operated in parallel, which will make the algorithm even faster.

4. Experimental results In this section, we conduct extensive experiments to evaluate GeoLasso on both synthetic and real world data sets. We first provide a brief description of the experimental setup and the data sets, then we evaluate the performance of different algorithms.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i

Q. Zhou et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

GLI−85 [n=85, p=22283]

5

SMK−CAN−187 [n=187, p=19993]

GLA−BRA−180 [n=180, p=49151]

10

0

10

−1

Other alg. runtime (sec)

Other alg. runtime (sec)

Other alg. runtime (sec)

10 1

1

10

0

10

10

10

2

1

0

10

−1

10 −1

0

10

1

10

−1

10

10

0

10

1

2

10

0

10

GeoLasso runtime (sec)

GeoLasso runtime (sec)

PEMS [n=440, p=138672]

Scene15 [n=544, p=71963]

1

10

2

10

10

GeoLasso runtime (sec) Dorothea [n=800, p=88119] 3

10

2

10

1

10

Other alg. runtime (sec)

Other alg. runtime (sec)

Other alg. runtime (sec)

2

10

1

10

2

10

1

10

0

10

0

0

10

0

10

1

10

2

10

3

0

10

10

GeoLasso runtime (sec)

1

10

2

10

3

10

10

GeoLasso runtime (sec) Shotgun

L1_Ls

SLEP

0

10

1

10

2

10

3

10

GeoLasso runtime (sec) Lars

Cgd

Fig. 5. Training time comparison of various algorithms. Each marker compares an algorithm with GeoLasso on one (out of six) data sets and one parameter setting. The X,Yaxes denote the running time of GeoLasso and that particular algorithm on the same problem, respectively.

4.1. Data sets and experiment Data sets: In this part, we use the prostate cancer data [11] to validate the correctness of our reduction method. The data set has eight clinical features (e.g. log(cancer volume), log(prostate weight)) and the response is the logarithm of prostate-specific antigen (lpsa). We also evaluate GeoLasso on several large scale data sets (Table 2) for speed comparison: GLI-85, a data set that screens a large number of diffuse infiltrating gliomas through transcriptional profiling; SMK-CAN-187, a gene expression data set from smokers w/o lung cancer; GLA-BRA-180, a data set concerning analysis of gliomas of different grades; PEMS [1], a data set that describes the occupancy rate, between 0 and 1, of different car lanes of San Francisco bay area freeways; Scene15, a scene recognition data set [8,16] where we use the binary class 6 and 7 for feature selection; Dorothea, a sparse data set from the NIPS 2003 feature selection contest, whose task is to predict which compounds bind to Thrombin.2 Experimental setting: We compare GeoLasso with several stateof-the-art Lasso implementations. LARS [4] is one of the most popular Lasso algorithms, here we use the Spasm package [10]. The Shotgun algorithm by Bradley et al. [3] parallelizes coordinate gradient descent. L1_LS is a MATLAB solver for Lasso implemented by Kim et al. [6]. SLEP is an implementation for Lasso by Liu et al. [9]. CGD is a coordinate gradient descent method coded in Matlab language [17]. All the experiments were performed on a desktop

2

We removed features with all-zero values across all inputs.

with two 8-core Intel(R) Xeon(R) processors of 2.67 GHz and 96 GB of RAM.

4.2. Comparison with LARS In this part, we mainly compare with LARS for two reasons. First, it is one of the most widely used Lasso solver. Second, both our method and LARS need to process the inverse of matrix while other algorithms won't. Correctness: To validate the correctness of the proposed method, we compare the regularization path of GeoLasso and LARS. We present a full regularization path in Fig. 3, obtained by increasing the budget t from essentially zero selecting almost all features. Each line in the figure corresponds to the wi value associated with a feature i ð ¼ 1; …; 8Þ as a function of the L1 budget t. The graph indicates that the two algorithms lead to exactly matching regularization paths as the budget t increases, thus verifying the correctness of GeoLasso. Numerical stability: In the original LARS paper [4], it assumes that the selected features form a linearly independent set. But as the recent paper [13] points out, the assumption may not be true in general, especially when the number of features is much larger than the number of data samples, we cannot guarantee that the inner product of selected features in LARS are nonsingular. Take the data sets Scene15 and Dorothea for example, the condition number of the matrix needs to be inversed in LARS becomes extremely large, as illustrated in Fig. 4. While our method has the advantage to guarantee features in each corral are affinely independent, which will make our method more numerically stable.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i

Q. Zhou et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

4.3. Comparison with other algorithms We compare GeoLasso with other state-of-the-art Lasso implementations in this part. Fig. 5 depicts the training time comparisons of the five baseline algorithms on the six data sets with GeoLasso. Each marker corresponds to a comparison of one algorithm and GeoLasso with a particular L1 budget. Its y-coordinate corresponds to the training time required by a certain and its x-coordinate corresponds to the training time required for GeoLasso with the same L1-norm budget. All markers above the diagonals corresponds to runs where GeoLasso is faster, and all markers below the diagonal corresponds to runs where GeoLasso is slower. We observe that across all six data sets GeoLasso is competitive with other Lasso solvers.

[12] R. Tibshirani, Regression shrinkage and selection via the Lasso: a retrospective, J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 73 (3) (2011) 273–282. [13] R.J. Tibshirani., The Lasso problem and uniqueness, Electron. J. Stat. 7 (2013) 1456–1490. [14] P. Wolfe, Finding the nearest point in a polytope, Math. Progr. 11 (1) (1976) 128–149. [15] Z. Xu, G. Huang, K. Q. Weinberger, A. Zheng, Gradient boosted feature selection, in: Proceedings of the 20th ACM SIGKDD, ACM, 2014, pp. 522–531. [16] Z. Xu, K.Q. Weinberger, O. Chapelle, The greedy miser: learning under test-time budgets, in: ICML, 2012, pp. 1175–1182. [17] S. Yun, K. Toh, A coordinate gradient descent method for l1-regularized convex minimization, Comput. Optim. Appl. 48 (2) (2011) 273–307. [18] Q. Zhou, W. Chen, S. Song, J. Gardner, K.Q. Weinberger, Y. Chen, A reduction of the elastic net to support vector machines with an application to gpu computing, in: Twenty-Nineth AAAI Conference on Artificial Intelligence, 2015. [19] Q. Zhou, S. Song, C. Wu, G. Huang., Kernelized LARS–LASSO for constructing radial basis function neural networks, Neural Comput. Appl. 23, (2013) 1969–1976.

5. Conclusion For the Lasso with L1 norm constraint, we first eliminate the L1 norm constraint by reformulating it as a nearest point problem. Then from a geometric perspective, we introduce a special algorithm to solve this problem efficiently. The proposed approach shows a practical way on the reduction of the Lasso and it may enhance the geometrical understanding of the Lasso. The computational results on several large scale data sets have shown that the proposed algorithm has better numerical algorithm than LARS and is competitive with other existing approaches. Acknowledgements This work was supported in part by National Natural Science Foundation of China under Grants 41427806 and 61273233, National Key Technology R & D Program under Grant 2012BAF01B03, Research Fund for the Doctoral Program of Higher Education under Grant 20130002130010, the Project of China Ocean Association under Grant DY125-25-02, and Tsinghua University Initiative Scientific Research Program under Grant 20131089300.

Quan Zhou received the B.S. degree from the School of Automation Science and Electrical Engineering, Beihang University, Beijing, China, and is currently pursuing the Ph.D. degree with the Department of Automation, Tsinghua University, Beijing. He was a Visiting Research Scholar with the Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA, in 2014. His current research interests include machine learning and statistical learning, especially in sparse learning.

Shiji Song received the Ph.D. degree from the Department of Mathematics, Harbin Institute of Technology, Harbin, China, in 1996. He is currently a Professor with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include system modeling, control and optimization, computational intelligence, and pattern recognition.

References [1] K. Bache, M. Lichman. UCI Machine Learning Repository, 2013. [2] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004. [3] J. Bradley, A. Kyrola, D. Bickson, C. Guestrin, Parallel coordinate descent for l1regularized loss minimization, in: ICML, 2011, 321–328. [4] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–499. [5] M. Jaggi. An equivalence between the lasso and support vector machines. arXiv:1303.1152, 2013. [6] S. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An interior-point method for large-scale l 1-regularized least squares, IEEE J. Sel. Top. Signal Process. 1 (4) (2007) 606–617. [7] M. Kusner, W. Chen, Q. Zhou, Z. Xu, K. Q. Weinberger, Y. Chen. Feature-cost sensitive learning with submodular trees of classifiers, in: Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [8] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: CVPR, vol. 2, 2006, pp. 2169–2178. [9] J. Liu, S. Ji, J. Ye, SLEP: Sparse Learning with Efficient Projections, Arizona State University, Tempe, AZ, 2009. [10] M. Schmidt, Least squares optimization with l1-norm regularization, CS542B Project Report, 2005. [11] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B (Methodol.) (1996) 267–288.

Gao Huang received the B.S. degree from the School of Automation Science and Electrical Engineering, Beihang University, Beijing, China, and is currently pursuing the Ph.D. degree with the Department of Automation, Tsinghua University, Beijing. He was a Visiting Research Scholar with the Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA, in 2013. His current research interests include machine learning and pattern recognition, especially in semi-supervised learning and robust learning.

Cheng Wu received the B.S. and M.S. degrees in electrical engineering from Tsinghua University, Beijing, China. Since 1967, he has been with Tsinghua University, where he is currently a Professor with the Department of Automation. His current research interests include system integration, modeling, scheduling, and optimization of complex industrial systems. Mr. Wu is a member of the Chinese Academy of Engineering.

Please cite this article as: Q. Zhou, et al., Efficient Lasso training from a geometrical perspective, Neurocomputing (2015), http://dx.doi. org/10.1016/j.neucom.2015.05.103i