Graph Structured Sparse Subset Selection

Graph Structured Sparse Subset Selection

Graph Structured Sparse Subset Selection Journal Pre-proof Graph Structured Sparse Subset Selection Hyungrok Do, Myun-Seok Cheon, Seoung Bum Kim PII...

2MB Sizes 0 Downloads 92 Views

Graph Structured Sparse Subset Selection

Journal Pre-proof

Graph Structured Sparse Subset Selection Hyungrok Do, Myun-Seok Cheon, Seoung Bum Kim PII: DOI: Reference:

S0020-0255(19)31212-5 https://doi.org/10.1016/j.ins.2019.12.086 INS 15129

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

7 April 2019 20 December 2019 30 December 2019

Please cite this article as: Hyungrok Do, Myun-Seok Cheon, Seoung Bum Kim, Graph Structured Sparse Subset Selection, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.086

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Highlights • We propose a variable selection method incorporating a graph structure. • The proposed method is formulated as a discrete optimization problem. • The proposed method selects adjacent variables in a graph simultaneously. • Experimental studies demonstrate the superiority of the proposed algorithm.

1

Graph Structured Sparse Subset Selection Hyungrok Doa , Myun-Seok Cheonb , Seoung Bum Kima,∗ a

School of Industrial Management Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Republic of Korea b ExxonMobil Research and Engineering Company, 1545 Route 22 East, Clinton Township Annandale, New Jersey 08801, USA

Abstract We propose a new method for variable subset selection and regression coefficient estimation in linear regression models that incorporates a graph structure of the predictor variables. The proposed method is based on the cardinality constraint that controls the number of selected variables and the graph structured subset constraint that encourages the predictor variables adjacent in the graph to be simultaneously selected or eliminated from the model. Moreover, we develop an efficient discrete projected gradient descent method to handle the NP-hardness of the problem originating from the discrete constraints. Numerical experiments on simulated and real-world data are conducted to demonstrate the usefulness and applicability of the proposed method by comparing it with existing graph regularization methods in terms of the predictive accuracy and variable selection performance. The results confirm that the proposed method outperforms the existing methods. Keywords: Variable subset selection, Graph structure, Constrained regression, Discrete optimization 1. Introduction The simultaneous variable selection and regression coefficient estimation in high-dimensional regression problems are considered in this study. Penalized regression methods that use regularization in the model fitting have been widely used for these problems. The most prominent example includes ∗

Corresponding author Email address: [email protected] (Seoung Bum Kim)

Preprint submitted to Information Sciences

January 2, 2020

the least absolute shrinkage and selection operator (Lasso) [1], which uses the `1 penalty to encourage sparsity. Besides the Lasso, many extensions have been proposed based on various sparsity inducing penalties. Examples include the smoothly clipped absolute deviation (SCAD) penalty [2], minimax concave penalty (MCP) [3], truncated lasso penalty (TLP) [4], and seamless`0 penalty (SELO) [5]. These methods encourage sparsity by shrinking the regression coefficients associated with less informative variables toward zero. There also have been various studies regarding penalized methods that not only encourage sparsity. For example, in the presence of highly correlated variables, the Lasso tends to select only one of the correlated variables resulting in an unsatisfactory variable selection and prediction performance [6]. To address this issue, Zou and Hastie [6] introduced the elastic net (Enet) that uses a combination of the `1 and `2 penalties. The Enet encourages a grouping effect in which strongly correlated variables tend to be in or out of the model together. Several methods, such as the octagonal shrinkage and clustering algorithm for regression (OSCAR) [7], hexagonal operator for regression with shrinkage and equality selection (HORSES) [8] and Mnet [9] were proposed to address the grouping effect issue. Likewise, several methods that directly utilize the correlation-based penalties have been presented to deal with correlated variables [10, 11, 12, 13]. A further approach concerns the incorporation of structural information of the predictor variables into the model. Yuan and Lin [14] proposed a group Lasso that uses predefined variable group information to select the variables. They imposed the `1,2 norm penalty on each group of variables for groupwise variable selection. A similar groupwise variable selection based on the `1,∞ norm penalty was also studied [15]. Vogt and Roth [16] generalized the group Lasso to the `1,p norm with 1 ≤ p ≤ ∞. Moreover, Jacob et al. [17] extended the group Lasso to the case of overlapping groups. It was further extended to hierarchical structures of groups in Zhao et al. [18], Jenatton et al. [19], Liu and Ye [20]. To promote generalization, graph structures have been incorporated into models in the literature. Li and Li [21] proposed a graph-constrained estimation procedure (Grace) that uses a combination of the `1 norm penalty on the regression coefficients and a graph Laplacian quadratic penalty. The graph Laplacian quadratic penalty encourages the neighboring variables in a graph to have the same regression coefficients. Li and Li [22] extended their work to adaptive Grace (aGrace) that forces the regression coefficients of neighboring variables to have the same absolute values. Kim et al. [23] 3

proposed the graph fused Lasso (GFLasso) that uses a penalty based on the pairwise absolute difference of regression coefficients adjacent in the graph. GFLasso is a generalization of the fused Lasso [24] for the case of a graph structure. Pan et al. [25] proposed a grouped variable selection method over a graph. Yang et al. [26] introduced several nonconvex penalties to handle a graph structure in a similar way. These methods were designed to encourage the smoothness over the graph. Shen et al. [27] proposed similar nonconvex penalty for the same purpose. On the other hand, Kim et al. [28] considered a method that encourages the regression coefficients of neighboring variables in the graph to be more likely zero or nonzero simultaneously rather than smoothing the regression coefficients over the graph. Most existing penalized regression methods are based on convex or continuous optimization because they are computationally tractable. Recently, several variable subset selection methods have been proposed based on discrete optimization. These methods consider the `0 -regularized regression problem, which is also known as the best subset selection problem. `0 -regularization selects the variables more sparsely than those of Lasso and its variants, and has less coefficient estimation bias. Bertsimas et al. [29] proposed a mixedinteger quadratic optimization (MIQO) approach to solve the best subset selection problem of choosing k out of p variables for linear regression problems. They also developed a discrete extension of the projected gradient descent algorithm to find a stationary point solution for the best subset selection problem. Moreover, several methods have been proposed to find the best subset with some additional considerations [30, 31, 32, 33]. Mazumder and Radchenko [34] proposed a mixed-integer linear optimization approach to a discrete extension of the Dantzig selector. Hazimeh and Mazumder [35] proposed a coordinate descent and an MIQO-based method to find a good local optimal solution for the `0 -regularized linear regression. Although all of these methods yielded reliable results in their settings, none of them considered the structural information of the predictor variables. Incorporating structural information of the predictor variables can lead to a better prediction performance and variable selection results. In this study, we propose a discrete optimization-based method to select a subset of variables in a regression problem while simultaneously considering the structural information of the predictor variables represented by a graph. We consider a constrained regression problem with a combination of the cardinality and graph structured subset constraints. The cardinality constraint is equivalent to the `0 -regularization that directly controls the number 4

of nonzero regression coefficients. On the other hand, the graph structured subset constraint encourages the adjacent variables in a given graph to be simultaneously selected or eliminated from the model. Both constraints only consider the variable selection and do not shrink the regression coefficients unlike the existing penalized method. Thus, our method does not suffer from an unwanted bias from the shrinkage and provides a more sparse model. The main contributions of this study can be summarized as follows: (1) We propose a graph structured sparse subset selection (Grass) method that minimizes the sum of squared errors subject to the cardinality and graph structured subset constraints. The cardinality constraint controls the maximum number of selected variables. The graph structured subset constraint forces the adjacent variables in the graph to be simultaneously selected or eliminated from the model. Both constraints include the indicator function, which is not continuous. We use mixed-integer quadratic optimization formulation to handle the discreteness. Unlike conventional continuous optimization-based approaches, our method provides highly sparse and accurate variable selection results. (2) Motivated by recent algorithmic advances in convex and discrete optimization, we develop an efficient discrete projected gradient descent method to find a stationary point solution for the graph structured sparse subset selection problem. Although the proposed discrete projected gradient descent method does not guarantee the reaching of the global optimum, it can solve the proposed difficult NP-hard graph structured sparse subset selection problem with acceptable computational time and accuracy. (3) To demonstrate the usefulness and applicability of our method, we conduct numerical experiments on simulated and real-world data to compare our method with existing methods in terms of predictive accuracy and performance in variable selection. The results confirm that our method outperforms other methods. The remainder of this paper is organized as follows. In Section 2, we review the existing variable subset selection methods based on mixed-integer optimization and graph regularization methods, which are closely related to our work. Section 3 describes the details of our method. Section 4 presents a simulation study to examine the performance of our method. Section 5 presents a case study to demonstrate the applicability of the proposed method. Finally, the concluding remarks are given in Section 6. 5

Notation Definition X Data matrix y Response variable vector β Regression coefficient vector ε Random noise vector z Variable selection indicator vector of β G Graph V Set of vertices of G E Set of edges of G n Number of observations p Number of predictor variables q Number of edges of G i Index of predictor variable i = 1, · · · , p j Index of predictor variable j = 1, · · · , p βi ith regression coefficient zi Binary variable indicating that βi is zero or not eij Binary variable indicating that zi and zj are the same or not k Grass hyperparameter controls the maximum number of nonzero variables t Grass hyperparameter controls the maximum number of violated adjacency constraints M Big M: a large constant value λ1 Hyperparameter for sparsity-inducing penalty term λ2 Hyperparameter for graph regularization term τ Hyperparameter for truncated Lasso penalty P(·) Penalty function I(·) Indicator function Jτ (·) Truncated Lasso penalty (TLP) function P(G,k,t) (·) Graph structured projection operator

Table 1: List of notations used in this study

2. Related Works Before introducing the graph structured sparse subset selection, we present a brief overview of the variable selection methods based on mixed-integer optimization, and describe the existing graph regularization methods. Table 1 shows the notations used in this study. Throughout the paper, we consider a linear regression model: y = Xβ + ε,

(1)

with a response vector y ∈ Rn×1 , a data matrix X = [x1 , · · · , xp ] ∈ Rn×p , regression coefficients β ∈ Rp×1 , and a random noise vector ε ∈ Rn×1 . We assume that the random noise is identically and independently distributed with a zero mean and constant variance. We also assume that the predictor 6

variables are standardized to have a zero mean and unit variance, and the response variable has a zero mean. 2.1. Variable Selection via Mixed-Integer Optimization In recent years, mixed-integer optimization (MIO) has been widely used to solve the best subset selection for regression or `0 -regularized regression problems. The mixed-integer quadratic optimization (MIQO) formulation of the best subset selection problem is given by: 1 ky − Xβk22 , β,z 2n subject to − Mzi ≤ βi ≤ Mzi , i = 1, · · · , p, p X zi ≤ k, minimize

i=1

zi ∈ {0, 1}, i = 1, · · · , p,

(2)

where βi is a regression coefficient, zi is the associated variable selection indicator, and M is an arbitrarily large constant satisfying kβ ∗ k∞ ≤ M provided β ∗ is a minimizer of the problem. By the first constraint of (2), if zi = 0 then βi = 0, implying that the variable is not selected. Conversely, if zi = 1, then βi can take any nonzero value Pp less than M, implying that the variable is selected. Thus, the quantity i=1 zi on the left-hand side of the second constraint of (2) is the number of nonzero variables in the model, and k is a hyperparameter that determines the maximum number of variables to select. This MIQO formulation can be solved by mathematical optimization solvers, such as Gurobi, CPLEX, or Xpress. Although the solvers can find a global optimal solution of the best subset selection problem of moderate size, it is still inadequate for addressing high-dimensional problems. Various techniques have been studied to overcome this computational limitation. Bertsimas et al. [29] proposed an efficient MIQO reformulation of the best subset selection problem based on the fact that solvers are better equipped at handling mixed-integer linear optimization (MILO) problems over MIQO problems. Moreover, they introduced several valid inequalities to reduce the search space of the problem without changing the global optimality. They also developed a discrete extension of the projected gradient descent methods to find a first order stationary solution of the problem and used it as an initial 7

solution. MIO-based approaches have been extended. Their work suggests that MIO methods for the best subset selection problem are tractable. Mazumder and Radchenko [34] proposed the discrete Dantzig selector that minimizes the number of nonzero regression coefficients subject to a budget on the maximal absolute correlation between the variables and residuals. They used an MILO approach to solve the discrete Dantzig selector. Moreover, they developed a discrete first order method to find a first order stationary solution that is provided as an upper bound of the original optimization problem. Hazimeh and Mazumder [35] introduced an efficient algorithm to find a good local optimal solution of the `0 -regularized regression problem. Their method is based on coordinate descent and local combinatorial optimization. MIQO is used to solve the local combinatorial optimization. Motivated by these works, we develop a discrete projected gradient descent method to solve the graph structured subset selection problem. 2.2. Graph Regularization Methods Most existing graph regularization methods have been developed in the penalized regression framework. Penalized regression methods solve the following optimization problem: minimize β

1 ky − Xβk22 + P(β), 2n

(3)

where P(β) is a penalty function. In addition to the penalized regression settings, consider an undirected and unweighted graph G = (V, E), where V = {1, · · · , p} is the set of vertices that corresponds to the predictor variables X1 , · · · , Xp , respectively. E = {(i, j) : i, j ∈ V } is the set of edges, indicating that the predictor variables Xi and Xj are adjacent in the graph. Table 2 summarizes the existing graph regularization methods. Note that di is the degree of vertex i, si is the sign of the Enet estimator, ρij is the sample correlation between Xi and Xj , and Jτ (|βi |) = max {|βi |/τ, 1} is a truncated Lasso penalty (TLP) [4]. Li and Li [21] introduced the Grace method with a combination of the `1 and graph Laplacian quadratic penalties. The `1 penalty is for a variable selection, while the graph Laplacian quadratic penalty encourages smoothness of the regression coefficients over the graph. Grace encourages (weighted) βi = βj if (i, j) ∈ E, based on the assumption that neighboring variables in the graph have the same effect, which explains the response variable. 8

Method Grace [21]

Penalty Function P(β) p X X |βi | +λ2 λ1 i=1 p

aGrace [22]

λ1

X i=1 p

GFLasso [23]

λ1

X i=1 p

Goscar [26]

λ1

X i=1

ncFGS [26]

λ1

p X i=1

ncTFGS [26, 27]

p X

LTLP [28]

|βi |

+λ2

|βi |

+λ2

|βi |

+λ2

|βi |

+λ2

X

(i,j)∈E

X

(i,j)∈E

X

(i,j)∈E

X

(i,j)∈E

Jτ (|βi |) +λ2

X

sβ sβ √i i − pj j di dj

!2 !2

|βi − sign(ρij )βj | max {|βi | , |βj |} ||βi | − |βj ||

Jτ (||βi | − |βj ||) X X  |βi |  − Jτ λ1 Jτ (|βi |) +λ2 J τ √ d i i=1 (i,j)∈E p X X  |βi |  λ1 |βi | +λ2 − Jτ Jτ √ di i=1 (i,j)∈E λ1

i=1 p

TTLP [28]

(i,j)∈E

β β √ i − pj di dj

(i,j)∈E

! ! |βj | p dj |β | pj dj

Table 2: Penalty functions of the existing graph regularization methods

In general, it is possible that the neighboring variables in the graph equally affect the target variable, while their affecting directions are opposite, or the variables are negatively correlated. In this case, it is more reasonable to assume (weighted) |βi | = |βj | if (i, j) ∈ E. Kim et al. [23] proposed a graph fused Lasso (GFLasso) with the `1 and graph fusion penalties, and Li and Li [22] introduced an adaptive version of the Grace (aGrace). The GFLasso addresses the negatively correlated predictor variables adjacent in the graph by the sample correlation and aGrace utilizes the signs of the Enet estimator for the same purpose. However, both methods may not work well for high-dimensional data because it is unknown which regression coefficients are truly zero. 9

To overcome the limitation, Yang et al. [26] proposed the graph OSCAR (Goscar) based on nonconvex feature grouping and selection penalty (ncFGS), and nonconvex truncated feature grouping and selection penalty (ncTFGS). Shen et al. [27] introduced a graph regularization method similar to the ncTFGS. This method directly encourages |βi | = |βj | if (i, j) ∈ E, and does not share the same difficulties of GFLasso and aGrace. Although the methods mentioned above are seemingly useful, the assumptions on the smoothness of the regression coefficients or their absolute values over a graph may be overly strict in some cases. Kim et al. [28] considered a more general assumption that the regression coefficients of two neighboring variables in a graph are simultaneously zero or nonzero, while their values or signs may or may not be equal; I(βi 6= 0) = I(βj 6= 0) if (i, j) ∈ E, where I(·) is an indicator function. However, their method does not directly handle the indicator functions to encourage the relationship. They proposed a new penalty with a TLP for variable selection and a TLP-based penalty for the grouping of indicators (TTLP) by replacing the indicator functions with TLPs, which is a computational surrogate of an indicator function that tends to Jτ (|β|) → I(|β| 6= 0) as τ → 0+ . Moreover, they also proposed a Lasso-based penalty for variable selection and a TLP-based penalty for the grouping of indicators (LTLP). Although their assumption to incorporate the graph structure is reasonable, their methods are based on the continuous approximation of the problem that required solving. In the next section, we propose a new formulation to consider I(βi 6= 0) = I(βj 6= 0) if (i, j) ∈ E and introduce an efficient method based on discrete optimization that does not include continuous approximations. 3. Proposed Method 3.1. Graph Structured Sparse Subset Selection (Grass) Under the linear regression setting with a graph on the predictor, we need to identify the best subset of variables that minimizes the mean squared error, while simultaneously encouraging the relationship: if (i, j) ∈ E then I(βi 6= 0) = I(βj 6= 0). We propose the graph structured sparse subset selection problem, a constrained linear regression that minimizes the mean squared error subject to two constraints: minimize β

1 ky − Xβk22 , 2n 10

subject to

p X i=1

X

(i,j)∈E

I(βj 6= 0) ≤ k, |I(βi 6= 0) − I(βj 6= 0)| ≤ t,

(4)

where I(·) is an indicator function. The first constraint of (4) is the cardinality constraint that limits the number of nonzero regression coefficients not to exceed k, which is a hyperparameter. The second constraint is the graph structured subset constraint that encourages the simultaneous selection or elimination of the neighboring predictor variables. Rather than force β to satisfy I(βi 6= 0) = I(βj 6= 0) for all edges, we allow several edges to be violated. The maximum number of edges allowed to be violated is controlled by t, which is also a hyperparameter. If t is greater than the number of edges, the problem is equivalent to the traditional best subset selection problem. Note that Problem (4) is NP-hard because it is not easier than Problem (2). Our method does not use shrinkage-based terms to select variables and to incorporate the graph structure of the predictor variables. Therefore, similar to the best subset selection, our method does not suffer from estimation bias and thus can select a more sparse subset of variables, compared with that of Lasso and its variants, which are based on continuous optimization. Moreover, the proposed method can directly control the levels of the sparsity and graph structured subset violation by adjusting the hyperparameters k and t, respectively. It is advantageous because we can intuitively determine how strictly or loosely the given graph structure is incorporated. 3.2. Mixed-Integer Quadratic Optimization Formulation A reformulation of the graph structured sparse subset selection problem as an MIQO is introduced as follows: 1 ky − Xβk22 , 2n subject to − Mzi ≤ βi ≤ Mzi , i = 1, · · · , p, minimize β,z,e

p X i=1

zi ≤ k,

− eij ≤ zi − zj ≤ eij , (i, j) ∈ E, 11

X

(i,j)∈E

eij ≤ t,

βi ∈ R,

i = 1, · · · , p,

zi ∈ {0, 1}, i = 1, · · · , p, eij ∈ {0, 1}, (i, j) ∈ E,

(5)

where M is an arbitrarily large constant satisfying kβ ∗ k∞ ≤ M where β ∗ is a minimizer of the problem. βi is a regression coefficient, and zi is the associated variable selection indicator. By the first constraint of (5), if zi = 0 then βi = 0. Conversely,Pif zi = 1, then βi can take any nonzero value less than M. Thus, quantity pi=1 zi on the left-hand side of the second constraint of (5) is the number of nonzero regression coefficients in the model, and k is a hyperparameter that determines the maximum number of variables to select. On the other hand, eij is an indicator of an edge violation. The third constraint of (5) yields that, if I(βi 6= 0) 6= I(βj 6= 0), or equivalently zi 6= zj , then eij = 1, indicating the associated edge (i, j) is violated. The quantity P (i,j)∈E eij in the right-hand side of the fourth constraint of (5) counts the number of violated edges, and thus, t is the maximum number of violated edges allowed. The MIQO formulation given above can be solved by general MIO solvers, including Gurobi, CPLEX, or Xpress. However, these solvers take more than a day for small size problems with p = 100s. To solve the problem more efficiently, we develop a new discrete projected gradient descent method. 3.3. Accelerated Discrete Projected Gradient Method The projected gradient descent algorithm is a first order optimization method to find the feasible local optimal solution of constrained optimization problems. Motivated by the work of Bertsimas et al. [29], we develop a discrete extension of the projected gradient descent algorithm to solve the graph structured sparse subset selection problem. Specifically, we apply the idea of the accelerated proximal gradient method for nonconvex programming to achieve a better convergence rate and to have the opportunity to escape from the local optima [36, 37]. Algorithm 1 describes the proposed accelerated discrete projected gradient P descent method. To simplify the notation, we denote S(G, t) = {β : (i,j)∈E |I(βi 6= 0) − I(βj 6= 0)| ≤ t} that is the set of vectors satisfying the graph constraint with a degree of violation t. Note that we use 12

Algorithm 1 Accelerated Discrete Projected Gradient Method for Grass 1: Input: Data X and y, graph G, and hyperparameters k and t ∗ 2: Output: Stationary point solution β 3: Initialize: (0) 4: β ∈ Rp×1 , such that kβ (0) k0 ≤ k and β (0) ∈ S(G, t) (1) 5: β ← β (0) (1) 6: γ (1) ← β 7: for m ≥ 1 do 1 8: α(m) = γ (m) − ∇g(γ (m) ) L 9: β (m) = P(G,k,t) (α(m) ) 10: if kβ (m) − β (m−1) k2 ≤  break m 11: δ (m) = β (m) + (β (m) − β (m−1) ) m+3 12: if g(β (m) ) ≤ g(δ (m) ) then 13: γ (m+1) = β (m) 14: else 15: γ (m+1) = δ (m) 16: end if 17: m←m+1 18: end for ∗ (m) 19: β ← β g(β) = (2n)−1 ky − Xβk22 . However, g can be any nonnegative convex function with a Lipschitz continuous gradient, and thus the possible choices of g includes a least absolute deviation loss or logistic loss function. L is a constant satisfying L ≥ ` where ` is the Lipschitz constant of g. The algorithm starts with a feasible solution β (0) , then iterates the gradient descent, projection, and extrapolation steps until the solution convergence is reached. In the gradient descent step, the algorithm tries to improve the current solution γ (m) by using the gradient ∇g. Subsequently, it performs the projection step that plays a critical role in the algorithm. P(G,k,t) (·) is the graph structured projection operator. We write β = P(G,k,t) (α) if β is the nearest feasible point of the Grass from α. Thus, the algorithm determines the projection of α(m) onto the feasible set. Finally, in the extrapolation step, the momentum is computed to accelerate the convergence. Consequently, the algorithm is monotonically decreasing. A momentum-based update is 13

accepted only if the objective value decreases. Algorithm 1 iteratively improves the objective function by the gradient descent and extrapolation steps and recovers the feasibility by applying the projection operator. The projection P(G,k,t) (α) can be obtained by solving an optimization problem. The details of the graph structured projection operator is discussed in the next section. 3.4. Graph Structured Projection Operator The graph structured projection β = P(G,k,t) (α) of α is a Euclidean projection of α onto the feasible set of the Grass. To obtain β, the following MIQO problem should be solved: p X minimize (βi − αi )2 , β,z,e

subject to

i=1

− Mzi ≤ βi ≤ Mzi , i = 1, · · · , p, p X zi ≤ k, i=1

− eij ≤ zi − zj ≤ eij , (i, j) ∈ E, X eij ≤ t,

(i,j)∈E

βi ∈ R, i = 1, · · · , p, zi ∈ {0, 1}, i = 1, · · · , p, eij ∈ {0, 1}, (i, j) ∈ E.

(6)

We call this optimization problem the graph structured projection problem, having the equivalent constraint to the original Grass problem because it finds a feasible point of the Grass problem. Although the graph structured projection problem can be solved by optimization solvers, we introduce a more efficient formulation. The problem can be linearized so that the projection P(G,k,t) (α) is obtained by solving mixed-integer linear programming, instead of mixed-integer quadratic programming. We begin with an optimality condition for the problem.

14

Lemma 1. An optimal solution β ∗ and z ∗ of the graph structured projection problem satisfies the following equation: βi∗ = αi zi∗ ,

i = 1, · · · , p.

(7)

Proof. Without loss of generality, we may assume |αi | > 0 for all i = 1, · · · , p. Let β ∗ = (β1∗ , · · · , βp∗ ) and z ∗ = (z1∗ , · · · , zp∗ ) be an optimal solution, then we have βi∗ = 0 if zi∗ = 0. The objective value associated with β ∗ is given by X X (8) αi2 + (βi∗ − αi )2 . {i:zi∗ =0}

{i:zi∗ 6=0}

Therefore, βi∗ = αi for i such that zi∗ = 1 is the optimal solution, and thus we have   αi , if zi∗ = 1 ∗ (9) βi =  0, if z ∗ = 0. i The linearized graph structured problem is introduced as follows: Theorem 1. The graph structured projection problem is equivalent to the following linearized graph structured projection problem:

maximize z,e

subject to

p X

αi2 zi ,

i=1 p

X i=1

zi ≤ k,

− eij ≤ zi − zj ≤ eij , (i, j) ∈ E, X eij ≤ t,

(i,j)∈E

zi ∈ {0, 1}, i = 1, · · · , p, eij ∈ {0, 1}, (i, j) ∈ E, where β is determined by Equation (7).

15

(10)

Proof. To show the equivalence of the two projection problems, we use the optimality condition induced in Lemma 1. Substituting βi = αi zi into the projection problem does not change the optimal solutions because it always holds at an optimal point. The first constraint of (6) can be rewritten as −Mzi ≤ βi ≤ Mzi ⇔ −Mzi ≤ αi zi ≤ Mzi ,

(11)

and can be eliminated by taking M = maxi |αi |. The objective function can now be rewritten as follows: min z,e

p X i=1

2

(αi zi − αi ) ⇔ min z,e

⇔ min z,e

⇔ min z,e

p X i=1 p

X i=1

p X

⇔ max z,e

αi2 (zi − 1)2

(12)

αi2 (zi2 − 2zi + 1)

(13)

αi2 (−zi )

(14)

i=1 p

X

αi2 zi .

(15)

i=1

From (13) to (14), we use zi2 = zi . Eliminating the first constraint of (6) and transforming the objective function given above yield the linearized graph structured projection problem. Because Algorithm 1 solves the graph structured projection problem per each iteration, it is important to solve the problem efficiently. The current state-of-the-art solvers are better equipped to handle mixed-integer linear optimization problems over mixed-integer quadratic optimization problems. Hence, our linearization approach significantly improves the computational efficiency. 3.5. Convergence Analysis In this section, we examine some convergence properties of Algorithm 1. To prove the convergence, we introduce the upper bound of the objective function and show the objective value monotonically decreases under the updating rules of the algorithm. This idea has been widely used in previous studies of the machine learning field [29, 38, 39, 40, 41]. Consider the following optimization problem: 16

minimize g(β) subject to kβk0 ≤ k, β ∈ S(G, t), β

(16)

where g is nonnegative, convex, and has the Lipschitz continuous gradient with the Lipschitz constant ` so that k∇g(β) − ∇g(α)k2 ≤ `kβ − αk2 ,

(17)

for any α and β. We first introduce an upper bound QL of g. Proposition 1 (Nesterov [42]). Given a convex function g satisfying condition (17) and for any L ≥ `, we have the following inequality: g(α) ≤ QL (α, β) = g(β) +

L kα − βk22 + ∇g(β)T (α − β). 2

(18)

Using an upper bound QL of g given in Proposition 1, we present a key lemma to demonstrate the convergence properties of Algorithm 1. Lemma 2. Let g be a nonnegative convex function satisfying condition (17). For any L ≥ ` and m ≥ 1, the following inequality holds:

2 L−`

(m) (m) (m) (m) (19) g(γ ) − g(β ) ≥

β − γ . 2 2 Proof. Let β be a vector such that kβk0 ≤ k and β ∈ S(G, t), and let β ∗ = P(G,k,t) (γ − L−1 ∇g(γ)). Using the upper bound QL of g introduced in Proposition 1, we have the following chain of inequalities: g(γ) = QL (γ, γ) ≥ inf QL (β, γ) kβk0 ≤k β∈S(G,t)

=

=

inf

kβk0 ≤k β∈S(G,t)

inf

kβk0 ≤k β∈S(G,t)



(20) (21)

 L 2 T kβ − γk2 + ∇g(γ) (β − γ) + g(γ) 2

(22)

!

  2

L 1 1

β − γ − ∇g(γ) − (23) k∇g(γ)k22 + g(γ)

2 L 2L 2

  2

L 1 1 ∗ = β − γ − ∇g(γ) − k∇g(γ)k22 + g(γ)

2 L 2L 2 17

(24)

L ∗ kβ − γk22 + ∇g(γ)T (β ∗ − γ) + g(γ) 2 L−` ∗ ` = kβ − γk22 + kβ ∗ − γk22 + ∇g(γ)T (β ∗ − γ) + g(γ) 2 |2 {z }

=

(25) (26)

Q` (β ∗ ,γ)

L−` ∗ kβ − γk22 + g(β ∗ ). 2 Therefore, we have: ≥

g(γ) − g(β ∗ ) ≥

L−` kγ − β ∗ k22 . 2

(27)

(28)

Applying (28) for γ = γ (m) and β ∗ = P(G,k,t) (γ (m) −L−1 ∇g(γ (m) )) = β (m) , we obtain (19). In addition, because kγ (m) − β (m) k22 ≥ 0, g(γ (m) ) ≥ g(β (m) ). Lemma 2 implies that the solution γ (m) is improved by applying the gradient descent and projection steps. The convergence properties are now as follows: Theorem 2. If L ≥ `, then the sequence {g(β (m) ) : m = 1, 2, · · · } generated by Algorithm 1 monotonically decreases and converges. Proof. Let {β (m) : m = 1, 2, · · · } and {γ (m) : m = 1, 2, · · · } be the sequences generated by Algorithm 1. From Lemma 2, g(γ (m+1) ) ≥ g(β (m+1) ), and by Algorithm 1 g(β (m) ) ≥ g(γ (m+1) ), we have g(β (m) ) ≥ g(γ (m+1) ) ≥ g(β (m+1) ), ∀m ≥ 1.

(29)

Consequently, the sequence {g(β (m) ) : m = 1, 2, · · · } is monotonically decreasing. Moreover, the sequence is bounded below by the condition g ≥ 0. Therefore, we can conclude that the sequence converges. Theorem 3. If L ≥ `, the sequence {β (m) : m = 1, 2, · · · }, which is generated by Algorithm 1 satisfies β (m+1) − β (m) → 0 as m → ∞. Proof. Let {β (m) : m = 1, 2, · · · } and {γ (m) : m = 1, 2, · · · } be the sequences, generated by Algorithm 1, then: kβ (m+1) − β (m) k22 ≤ kβ (m+1) − γ (m+1) k22  2  ≤ g(γ (m+1) ) − g(β (m+1) ) L−`  2  ≤ g(β (m) ) − g(β (m+1) ) . L−` 18

(30) (31) (32)

By Theorem 2, the sequence {g(β (m) ) : m = 1, 2, · · · } monotonically decreases and converges, which implies g(β (m) ) − g(β (m+1) ) → 0. Therefore, kβ (m+1) − β (m) k22 → 0, and thus, β (m+1) − β (m) → 0. 3.6. Computational Complexity In this section, we examine the computational complexity of Algorithm 1. Each iteration of the algorithm consists of three steps: gradient descent, extrapolation, and projection steps. The gradient descent step for least squares loss function requires O(p2 ) and the extrapolation step requires O(p). For the projection step, we used optimization solvers to find an optimal solution to the graph structured projection problem. The solvers use branch and cut algorithm, which can solve general integer programming problems. However, the worst case complexity of branch and cut algorithm for integer linear programming with m binary variables is O(2m ), which is equivalent to exhaustive search. Therefore, the computational complexity of the projection step is O(2p+q ), where p is the number of predictor variables, and q is the number of edges in the given graph. Because O(2p+q ) dominates O(p2 ) and O(p), a single iteration of Algorithm 1 requires O(2p+q ). Therefore, assuming the fixed number of iterations, the computational complexity of the algorithm is O(2p+q ). Although the theoretical computational complexity of Algorithm 1 is O(2p+q ), the algorithm is scalable for moderate size problems. It is because the optimization solvers are far efficient than exhaustive search to solve integer linear programming, although they are not theoretically guaranteed to find an optimal solution faster. We investigate the empirical computational complexity of Algorithm 1 experimentally in the next section. 4. Simulation Study To evaluate the performance of the proposed graph structured sparse subset selection, we conducted simulation study with nine regularized linear regression methods: Lasso [1], Enet [6], and subset selection based on the discrete first order algorithm (Subset) [29] for variable selection without incorporating graph structures of the predictor variables; and aGrace [22], GFLasso [23], Goscar, ncFGS [26], LTLP, TTLP [28], and the proposed Grass for the graph regularization methods. Because of the high computational cost of LTLP and TTLP, these two methods were not applied to data sets with p > 1000. 19

4.1. Simulation Setup Simulated data sets were generated from the following linear regression model: y = Xβ + ε, (33) where ε ∼ N (0, σ 2 ) is a random noise. For all cases, we set σ 2 = kβk22 /4. We considered two simple scenarios that have been used in graph regularization studies: gene regulatory networks and random networks. Each simulation data set consisted of a training set of size ntrain , an independent validation set of size nvalid , and a testing set of size ntest . The training data set was used to train models, while the validation data set was used to determine the hyperparameters of the models. The testing set was used for the purpose of validating performances. 4.1.1. Gene Regulatory Networks We simulated simple gene regulatory networks that have been commonly used in graph regularization studies to compare the performances of different methods. The simulation setups were similar to those in Li and Li [21], Li and Li [22], Yang et al. [26], and Kim et al. [28]. The gene regulatory network was composed of independent subnetworks, each including one transcription factor and its ten target genes. There was no edge between any other two genes. The standard normal distribution was used for the expression level of transcription factors, that is XTFi ∼ N (0, 1). The expression level of the target gene it regulates XTGij was generated from N (0.7XTFi , 0.51). With this setup, a transcription factor XTFi and its target gene XTGij were jointly distributed as a bivariate normal with a correlation of 0.7, and any two target genes were conditionally independent given XTF . We considered small-size problems with ten subnetworks and large-size problems with 200 subnetworks. Graphs used in the gene regulatory networks are presented in Figure 1. For the small-size problems, we defined six scenarios. Each set of a scenario consisted of a training set with ntrain = 50, validation set with nvalid = 50, and testing set with ntest = 200. As mentioned above, there were ten subnetworks, and each subnetwork was composed of 11 variables. Thus, the dimensionality of the data was p = 110. Graphs used for the small-size gene regulatory network scenarios are displayed in Figure 1–(a) and Figure 1–(b).

20

(a) RegN-S1 – RegN-S3

(b) RegN-S4 – RegN-S6

(c) RegN-L1 – RegN-L3

(d) RegN-L4 – RegN-L6

Figure 1: Graph structures used in the gene regulatory network scenarios

• RegN-S1: The response vector y was generated from the linear model (33) based on the following true regression coefficients: 5 5 −3 −3 β = (5, √ , · · · , √ , −3, √ , · · · , √ , 0, · · · , 0)T . 10 10 10 10 | {z } | {z } | {z } 88 10

10

• RegN-S2: The setups were the same as that of RegN-S1, except that

21

the true regression coefficients were given by: −5 −5 5 5 β = (5, √ , · · · , √ , √ , · · · , √ , 10 10 10 10 {z } | {z } | 3

7

−3 3 3 −3 − 3, √ , · · · , √ , √ , · · · , √ , 0, · · · , 0)T . 10 10 10 10 | {z } {z } | {z } | 88 3

7

• RegN-S3: The setups were the same as that of RegN-S1, except that the true regression coefficients were given by: 5 5 β = (5, 0, · · · , 0, √ , · · · , √ , | {z } 10 10 | {z } 3 7

−3 −3 − 3, 0, · · · , 0, √ , · · · , √ , 0, · · · , 0)T . | {z } 10 10 | {z } | {z } 3 88 7

• RegN-S4, RegN-S5, and RegN-S6: The setups were the same as that of RegN-S1, RegN-S2, and RegN-S3, respectively, including the regression coefficients, except that ten edges were randomly removed from the network before ten edges were randomly added. Similar to the small-size problems, we defined six scenarios for large-size problems. Each set under this scenario consisted of ntrain = 200, nvalid = 200, and ntest = 1000. There were 100 subnetworks where each subnetwork included 11 variables, and thus p = 2200. Graphs used for the large-size gene regulatory network scenarios are presented in Figure 1–(c) and Figure 1–(d). • RegN-L1: The response vector y was generated from the linear model (33) based on the following true regression coefficients: 5 −5 −5 5 β = (5, √ , · · · , √ , −5, √ , · · · , √ , 10 10 10 10 | {z } | {z } 10

10

3 3 −3 −3 3, √ , · · · , √ , −3, √ , · · · , √ , 0, · · · , 0)T . 10 10 10 10 | {z } | {z } | {z } 2156 10

10

22

• RegN-L2: The setups were the same as that of RegN-L1, except that the true regression coefficients are given by: −5 −5 5 5 β = (5, √ , · · · , √ , √ , · · · , √ , 10 10 10 10 | {z } | {z } 3

7

5 −5 −5 5 − 5, √ , · · · , √ , √ , · · · , √ , 10 10 10 10 | {z } | {z } 3

7

−3 −3 3 3 3, √ , · · · , √ , √ , · · · , √ , 10 10 10 10 {z } | {z } | 3

7

−3 3 3 −3 − 3, √ , · · · , √ , √ , · · · , √ , 0, · · · , 0)T . 10 10 10 10 | {z } {z } | {z } 2156 | 3

7

• RegN-L3: The setups were the same as that of RegN-L1, except that the true regression coefficients are given by: 5 5 β = (5, 0, · · · , 0, √ , · · · , √ , | {z } 10 10 | {z } 3 7

−5 −5 − 5, 0, · · · , 0, √ , · · · , √ , | {z } 10 10 | {z } 3 7

3 3 3, 0, · · · , 0, √ , · · · , √ , | {z } 10 10 | {z } 3 7

−3 −3 − 3, 0, · · · , 0, √ , · · · , √ , 0, · · · , 0)T . | {z } 10 10 | {z } | {z } 2156 3 7

• RegN-L4, RegN-L5, and RegN-L6: The setups were the same as those of RegN-L1, RegN-L2, and RegN-L3 including the regression coefficients, except that 100 edges were randomly removed from the network before 100 edges were randomly added.

23

(a) RandN-S1

(b) RandN-S2

(c) RandN-S3

(d) RandN-S4

(e) RandN-L1

(f) RandN-L2

(g) RandN-L3

(h) RandN-L4

Figure 2: Graph structures used in the random network scenarios

4.1.2. Random Networks We simulated simple random networks following Tai et al. [43]. We considered small-size and large-size problems consisting of 100 and 2000 variables, respectively. For small-size problems, we randomly divided 100 variables into ten groups and generated a graph containing ten subgraphs. Each subgraph corresponded to one of ten groups and was completely connected, while there were no edges between any two different subgraphs. We randomly removed 24

300 edges from the graph. Subsequently, we added edges to connect all the subgraphs. For each scenario, we selected one of ten subgraphs to be informative and simulated their true regression coefficients from N (0, 22 ). The remaining regression coefficients were set to 0. X was simulated from an independent standard normal distribution, that is, Xi ∼ N (0, 12 ) for all i. For each set of this scenario, we used ntrain = 50, nvalid = 50, and ntest = 200. We simulated four different random networks, which are presented in Figure 2–(a) – Figure 2–(d). For large-size problems, we randomly divided 2000 variables into 200 groups, and generated a graph containing 200 subgraphs. Each subgraph corresponded to one of 200 subgraphs, and was completely connected, while there were no edges between any two different subgraphs. We randomly removed 3000 edges and subsequently added 500 edges to connect the subgraphs. For each scenario, we selected four of 200 groups to be informative, and simulated their regression coefficients from N (0, 22 ). The remaining true coefficients were set to 0. X was simulated from a standard normal distribution, that is Xi ∼ N (0, 12 ) for all i. For each set of this scenario, we used ntrain = 200, nvalid = 200, and ntest = 1000. We simulated four different random networks, which are presented in Figure 2–(e) – Figure 2–(h). 4.2. Determining Hyperparameters We performed grid search to determine the hyperparameters of the regularized linear regression models. We selected the hyperparameter that yielded the lowest predictive performance on the validation set. For each method, we set the grid of hyperparameters as follows: For the Lasso which has one hyperparameter to determine λ, we comˆ = 0. puted λmax as in Friedman et al. [44] such that the smallest λ yields β Hyperparameter λ was searched over a set of 100 equally spaced grid points within the interval [0.001λmax , λmax ]. For Enet, aGrace, GFLasso, Goscar, and ncFGS, with two hyperparameters λ1 and λ2 , we first calculated λmax as same as in Lasso. Then, we searched hyperparameters λ and α instead of λ1 and λ2 such that λ1 = αλ and λ2 = (1 − α)λ. We used intervals [0.001λmax , λmax ] for λ and [0.1, 1.0] for α. For each hyperparameter, we tried ten equally spaced grid points within the interval. For LTLP and TTLP which have three hyperparameters λ1 , λ2 , and τ , we followed the method of Kim et al. [28] to determine the hyperparameters. For the proposed Grass, we searched k over a set of ten equally spaced grid points within the interval [0.1pmax , pmax ] where pmax = max{n, p}. Note 25

that pmax is the maximum number of nonzero variables to avoid the underdetermined system for linear regression. On the other hand, we searched t over a set of ten equally spaced grid points within the interval [0.1pmax , pmax ] and [0.01q, 0.1q] where q is the number of edges of the given graph. This setting implies we allow a maximum of 10% of the edges to be violated. Since our objective is to incorporate the graph structure, we do not use a large value for t. 4.3. Performance Comparison We used three performance measures: the predictive root mean squared error (PRMSE), coefficient root mean squared error (CRMSE), and F1-score for variable selection. The PRMSE measures thePpredictive performance of a model on testing set and is calculated by (n−1 ni=1 (yi − yˆi )2 )−1/2 where y and yˆ are the true and estimated response values, respectively. The PCRMSE measures the coefficient estimation bias and is calculated by (p−1 pi=1 (βi − βˆi )2 )−1/2 where β and βˆ are the true and estimated regression coefficients, respectively. The F1-score evaluates the variable selection performances, which is the harmonic mean of the precision and sensitivity. The precision measures the proportions of actual nonzero variables that are selected by a model, and the sensitivity measures the proportion of the actual nonzero variables that are correctly selected by a model. A model achieves the highest sensitivity when it selects all of the true nonzero variables, indicating its ability to detect truly significant variables. On the other hand, the highest precision is achieved when a model selects all true nonzero variables and does not include any true zero variables. The F1-score considers both aspects. A model with a high F1-score implies that the model has selected significant variables accurately and sparsely. We generated 50 replicates for each scenario and computed three validation measures for the models. We calculated average and standard deviations of the measures over the repeated experiments and presented them in Table 3 – Table 6. The values in parentheses are standard deviations. We also calculated the average and ranks of the averaged performance measures for each scenario. They were presented in the last two rows of the tables. The ranks are computed in each scenario (row of the tables), and thus, they imply how the methods are relatively good in each scenario. The ranks for PRMSE and CRMSE were assigned in ascending order (assigning rank 1 to the smallest value), while the ranks for the F1-score were assigned in descending order. The numbers in boldface indicate the best performances of 26

Scenario

Measure PRMSE

RegN-S1

CRMSE F1-score PRMSE

RegN-S2

CRMSE F1-score PRMSE

RegN-S3

CRMSE F1-score PRMSE

RegN-S4

CRMSE F1-score PRMSE

RegN-S5

CRMSE F1-score PRMSE

RegN-S6

CRMSE F1-score

PRMSE CRMSE F1-score PRMSE Average Rank CRMSE F1-score Average

Lasso 5.851 (0.526) 0.561 (0.085) 0.646 (0.091) 6.043 (0.433) 0.611 (0.064) 0.528 (0.059) 5.260 (0.486) 0.452 (0.066) 0.597 (0.101) 5.851 (0.526) 0.561 (0.085) 0.646 (0.091) 6.043 (0.433) 0.611 (0.064) 0.528 (0.059) 5.260 (0.486) 0.452 (0.066) 0.597 (0.101) 5.781 0.541 0.591 6.3 5.0 8.0

Enet 5.488 (0.476) 0.399 (0.060) 0.641 (0.148) 5.866 (0.465) 0.577 (0.039) 0.549 (0.079) 5.110 (0.451) 0.408 (0.058) 0.546 (0.132) 5.488 (0.476) 0.399 (0.060) 0.641 (0.148) 5.866 (0.465) 0.577 (0.039) 0.549 (0.079) 5.110 (0.451) 0.408 (0.058) 0.546 (0.132) 5.488 0.461 0.579 4.3 3.2 8.2

Subset 5.585 (0.661) 0.555 (0.135) 0.821 (0.091) 6.232 (0.311) 0.659 (0.094) 0.645 (0.064) 5.306 (0.706) 0.539 (0.103) 0.697 (0.090) 5.585 (0.661) 0.555 (0.135) 0.821 (0.091) 6.232 (0.311) 0.659 (0.094) 0.645 (0.064) 5.306 (0.706) 0.539 (0.103) 0.697 (0.090) 5.707 0.584 0.721 7.3 7.2 2.3

aGrace 4.852 (0.494) 0.175 (0.065) 0.767 (0.201) 5.459 (0.694) 0.364 (0.120) 0.643 (0.084) 4.923 (0.268) 0.330 (0.053) 0.660 (0.079) 5.058 (0.520) 0.232 (0.059) 0.769 (0.096) 5.764 (0.489) 0.486 (0.069) 0.621 (0.076) 4.972 (0.464) 0.356 (0.060) 0.646 (0.084) 5.171 0.494 0.685 2.8 1.0 3.7

GFLasso 4.630 (0.395) 0.373 (0.015) 0.619 (0.142) 5.748 (0.383) 0.616 (0.045) 0.599 (0.109) 4.693 (0.307) 0.470 (0.029) 0.540 (0.151) 4.855 (0.427) 0.430 (0.039) 0.495 (0.123) 5.869 (0.382) 0.628 (0.047) 0.646 (0.108) 4.915 (0.351) 0.479 (0.044) 0.599 (0.110) 5.118 0.513 0.627 1.8 4.3 5.8

Goscar 5.634 (0.320) 0.571 (0.060) 0.758 (0.157) 6.212 (0.496) 0.718 (0.047) 0.546 (0.089) 5.341 (0.470) 0.607 (0.038) 0.657 (0.130) 5.863 (0.547) 0.607 (0.050) 0.711 (0.116) 6.182 (0.506) 0.720 (0.050) 0.554 (0.092) 5.384 (0.521) 0.615 (0.048) 0.597 (0.116) 5.769 0.639 0.637 7.3 9.2 6.2

ncFGS 4.669 (0.427) 0.388 (0.048) 0.743 (0.086) 5.889 (0.534) 0.672 (0.070) 0.676 (0.147) 4.721 (0.276) 0.498 (0.024) 0.666 (0.101) 5.350 (0.662) 0.479 (0.081) 0.784 (0.100) 6.213 (0.476) 0.725 (0.058) 0.558 (0.085) 5.224 (0.482) 0.557 (0.058) 0.642 (0.113) 5.344 0.553 0.678 4.2 7.0 3.8

LTLP 6.400 (0.651) 0.686 (0.129) 0.612 (0.142) 6.284 (0.473) 0.626 (0.069) 0.531 (0.113) 5.665 (0.649) 0.518 (0.121) 0.523 (0.121) 6.420 (0.587) 0.650 (0.138) 0.601 (0.116) 6.305 (0.552) 0.621 (0.086) 0.532 (0.107) 5.676 (0.761) 0.528 (0.137) 0.487 (0.101) 6.125 0.605 0.548 10.0 7.2 9.7

TTLP 5.972 (0.818) 0.706 (0.216) 0.691 (0.227) 6.219 (0.487) 0.621 (0.070) 0.575 (0.184) 5.368 (0.576) 0.584 (0.145) 0.656 (0.193) 6.054 (0.668) 0.712 (0.229) 0.685 (0.231) 6.231 (0.596) 0.654 (0.086) 0.560 (0.194) 5.448 (0.705) 0.604 (0.143) 0.592 (0.186) 5.882 0.647 0.627 8.7 8.5 6.3

Grass 4.913 (0.495) 0.378 (0.084) 0.971 (0.071) 5.253 (0.643) 0.501 (0.055) 0.970 (0.070) 4.879 (0.441) 0.428 (0.058) 0.805 (0.064) 4.933 (0.346) 0.388 (0.056) 0.960 (0.021) 5.265 (0.362) 0.508 (0.057) 0.909 (0.076) 4.946 (0.444) 0.439 (0.068) 0.777 (0.066) 5.032 0.440 0.899 2.2 2.5 1.0

Table 3: Experimental results of small-size gene regulatory network scenarios

PRMSE, CRMSE, and F1-score in each scenario. Table 3 shows results of the small-size gene regulatory network scenarios. The proposed Grass exhibited the best predictive performance in the two cases. The Grass ranked second place in terms of PRMSE among the compared algorithms and scored the best average PRMSE. Furthermore, the Grass had the best F1-score for all of the cases, best average F1-score, and best average rank in terms of F1-score. These results suggest that the proposed Grass clearly outperformed the other methods in terms of variable 27

Scenario

Measure PRMSE

RegN-L1

CRMSE F1-score PRMSE

RegN-L2

CRMSE F1-score PRMSE

RegN-L3

CRMSE F1-score PRMSE

RegN-L4

CRMSE F1-score PRMSE

RegN-L5

CRMSE F1-score PRMSE

RegN-L6

CRMSE F1-score

Average

Average Rank

PRMSE CRMSE F1-score PRMSE CRMSE F1-score

Lasso 7.568 (0.325) 0.182 (0.026) 0.553 (0.068) 8.073 (0.222) 0.156 (0.008) 0.458 (0.045) 6.717 (0.306) 0.125 (0.016) 0.509 (0.091) 7.568 (0.325) 0.182 (0.026) 0.553 (0.068) 8.073 (0.222) 0.156 (0.008) 0.458 (0.045) 6.717 (0.306) 0.125 (0.016) 0.509 (0.091) 7.453 0.154 0.507 6.8 5.5 6.7

Enet 7.398 (0.359) 0.090 (0.008) 0.567 (0.120) 8.030 (0.271) 0.156 (0.007) 0.480 (0.093) 6.715 (0.340) 0.085 (0.009) 0.470 (0.086) 7.398 (0.359) 0.090 (0.008) 0.567 (0.120) 8.030 (0.271) 0.156 (0.007) 0.480 (0.093) 6.715 (0.340) 0.085 (0.009) 0.470 (0.086) 7.381 0.110 0.506 5.5 2.3 6.7

Subset 6.697 (0.215) 0.135 (0.024) 0.895 (0.054) 7.569 (0.188) 0.174 (0.015) 0.642 (0.050) 6.177 (0.265) 0.116 (0.026) 0.768 (0.055) 6.697 (0.215) 0.135 (0.024) 0.895 (0.054) 7.569 (0.188) 0.174 (0.015) 0.642 (0.050) 6.177 (0.265) 0.116 (0.026) 0.768 (0.055) 6.814 0.142 0.768 2.7 5.0 2.3

aGrace 6.699 (0.347) 0.041 (0.008) 0.640 (0.087) 7.870 (0.277) 0.149 (0.005) 0.509 (0.086) 6.753 (0.415) 0.096 (0.009) 0.488 (0.098) 6.773 (0.347) 0.050 (0.006) 0.636 (0.086) 7.880 (0.260) 0.151 (0.004) 0.503 (0.076) 6.759 (0.411) 0.098 (0.008) 0.487 (0.096) 7.122 0.098 0.544 5.2 1.7 5.7

GFLasso 6.531 (0.243) 0.116 (0.002) 0.373 (0.055) 7.548 (0.315) 0.179 (0.006) 0.404 (0.086) 6.750 (0.291) 0.126 (0.007) 0.387 (0.063) 6.732 (0.269) 0.123 (0.003) 0.466 (0.050) 7.720 (0.248) 0. 183 (0.004) 0.506 (0.093) 6.823 (0.291) 0.124 (0.008) 0.556 (0.087) 7.017 0.142 0.448 4.2 5.3 7.0

Goscar 7.073 (0.351) 0.140 (0.005) 0.705 (0.068) 8.079 (0.333) 0.204 (0.010) 0.565 (0.099) 6.952 (0.337) 0.159 (0.010) 0.574 (0.078) 7.450 (0.376) 0.151 (0.008) 0.695 (0.101) 8.061 (0.267) 0.197 (0.006) 0.594 (0.084) 6.984 (0.365) 0.158 (0.011) 0.585 (0.081) 7.433 0.168 0.620 7.3 7.3 3.8

ncFGS 6.139 (0.146) 0.115 (0.001) 0.946 (0.096) 7.988 (0.349) 0.205 (0.013) 0.646 (0.124) 6.362 (0.170) 0.150 (0.006) 0.736 (0.113) 6.633 (0.247) 0.124 (0.003) 0.770 (0.094) 7.996 (0.338) 0.201 (0.014) 0.581 (0.123) 6.644 (0.311) 0.148 (0.010) 0.675 (0.088) 6.961 0.157 0.726 3.2 6.5 2.8

Grass 6.315 (0.179) 0.093 (0.008) 0.997 (0.005) 6.833 (0.782) 0.132 (0.012) 0.975 (0.058) 6.109 (0.268) 0.102 (0.016) 0.847 (0.039) 6.372 (0.213) 0.094 (0.007) 0.983 (0.014) 6.836 (0.560) 0.136 (0.015) 0.960 (0.041) 6.109 (0.227) 0.110 (0.017) 0.827 (0.047) 6.429 0.111 0.931 1.2 2.3 1.0

Table 4: Experimental results of large-size gene regulatory network scenarios

selection performance. On the other hand, the Grass ranked second place in terms of the CRMSE, implying that the method provided a good estimation bias. Results of large-size gene regulatory network scenarios are presented in Table 4. Under this simulation setting, the Grass showed the best predictive performance in five cases and ranked first place in terms of the PRMSE. Moreover, the Grass outperformed the other methods with regard to the variable selection performance in all cases. Besides, the Grass ranked second

28

Scenario

Measure PRMSE

RandN-S1

CRMSE F1-score PRMSE

RandN-S2

CRMSE F1-score PRMSE

RandN-S3

CRMSE F1-score PRMSE

RandN-S4

CRMSE F1-score

PRMSE CRMSE F1-score PRMSE Average Rank CRMSE F1-score Average

Lasso 7.625 (0.711) 0.562 (0.077) 0.443 (0.077) 3.352 (0.333) 0.232 (0.045) 0.404 (0.080) 6.328 (0.504) 0.457 (0.058) 0.454 (0.081) 4.233 (0.487) 0.271 (0.060) 0.389 (0.089) 5.385 0.380 0.423 5.0 5.3 7.3

Enet 7.976 (0.653) 0.609 (0.071) 0.382 (0.087) 3.467 (0.319) 0.251 (0.044) 0.349 (0.067) 6.549 (0.471) 0.485 (0.047) 0.421 (0.069) 4.369 (0.476) 0.294 (0.057) 0.354 (0.082) 5.592 0.410 0.377 7.5 8.0 9.0

Subset 8.018 (0.885) 0.596 (0.125) 0.570 (0.098) 3.587 (0.443) 0.256 (0.056) 0.463 (0.134) 6.618 (0.587) 0.490 (0.072) 0.487 (0.093) 4.479 (0.592) 0.292 (0.076) 0.425 (0.088) 5.675 0.408 0.487 8.8 8.0 4.8

aGrace 9.491 (0.828) 0.776 (0.103) 0.395 (0.132) 4.012 (0.327) 0.321 (0.036) 0.342 (0.076) 8.196 (0.532) 0.682 (0.048) 0.413 (0.092) 5.899 (0.603) 0.480 (0.058) 0.332 (0.069) 6.900 0.565 0.371 10.0 10.0 9.8

GFLasso 7.464 (0.780) 0.527 (0.099) 0.457 (0.131) 3.420 (0.291) 0.239 (0.041) 0.343 (0.090) 6.333 (0.409) 0.449 (0.043) 0.454 (0.075) 4.315 (0.408) 0.279 (0.052) 0.396 (0.068) 5.454 0.384 0.276 5.5 5.0 7.3

Goscar 7.466 (0.688) 0.516 (0.085) 0.504 (0.137) 3.526 (0.325) 0.263 (0.042) 0.362 (0.087) 6.441 (0.460) 0.467 (0.044) 0.472 (0.090) 4.497 (0.452) 0.305 (0.057) 0.385 (0.074) 5.383 0.374 0.412 7.5 7.5 6.8

ncFGS 6.655 (0.499) 0.395 (0.071) 0.677 (0.138) 3.408 (0.375) 0.249 (0.049) 0.412 (0.115) 6.049 (0.517) 0.411 (0.061) 0.576 (0.137) 4.345 (0.481) 0.283 (0.060) 0.439 (0.107) 5.114 0.334 0.526 3.8 4.0 3.5

LTLP 7.222 (0.663) 0.481 (0.088) 0.673 (0.177) 3.180 (0.340) 0.217 (0.045) 0.555 (0.091) 6.084 (0.444) 0.418 (0.071) 0.533 (0.160) 4.010 (0.407) 0.232 (0.067) 0.488 (0.087) 5.124 0.337 0.562 3.0 3.3 3.3

TTLP 7.197 (0.843) 0.466 (0.101) 0.769 (0.099) 3.257 (0.400) 0.213 (0.055) 0.524 (0.133) 6.269 (0.562) 0.456 (0.088) 0.533 (0.170) 3.994 (0.492) 0.221 (0.076) 0.491 (0.068) 5.179 0.339 0.579 3.0 3.0 2.5

Grass 6.244 (0.483) 0.316 (0.074) 0.965 (0.029) 2.890 (0.250) 0.173 (0.029) 0.946 (0.045) 5.109 (0.343) 0.266 (0.053) 0.948 (0.026) 3.801 (0.300) 0.197 (0.041) 0.936 (0.041) 4.511 0.238 0.949 1.0 1.0 1.0

Table 5: Experimental results of small-size random network scenarios

in terms of the CRMSE among the methods for the large-size gene regulatory networks scenario. We summarized results of small-size and large-size random network scenarios in Table 5 and Table 6, respectively. The PRMSE, CRMSE, and F1-score of the proposed Grass were all the better than those of the other methods. These results imply that the proposed method correctly identified the truly significant variables, while correctly estimating the regression coefficients, and thus yielded accurate prediction results. These experimental results have the following three implications. First, our method clearly outperformed the subset selection method (Subset) with respect to the predictive, coefficient estimation, and variable selection performances. These results imply that our treatment of the graph structure of the predictor variables is valid. That is, better predictive and variable selection performances can be achieved if we have a proper graph of the predictor variables. Second, our method outperformed the existing graph regulariza29

Scenario

Measure PRMSE

RandN-L1

CRMSE F1-score PRMSE

RandN-L2

CRMSE F1-score PRMSE

RandN-L3

CRMSE F1-score PRMSE

RandN-L4

CRMSE F1-score

PRMSE CRMSE F1-score PRMSE Average Rank CRMSE F1-score Average

Lasso 22.045 (1.245) 0.429 (0.030) 0.274 (0.068) 9.992 (0.820) 0.187 (0.021) 0.259 (0.037) 13.192 (0.732) 0.229 (0.021) 0.255 (0.027) 7.963 (0.315) 0.119 (0.011) 0.259 (0.037) 13.298 0.241 0.262 6.0 6.0 6.0

Enet 22.056 (0.926) 0.431 (0.021) 0.231 (0.039) 10.226 (0.795) 0.193 (0.019) 0.219 (0.041) 13.874 (0.711) 0.249 (0.020) 0.211 (0.035) 8.568 (0.297) 0.139 (0.010) 0.195 (0.025) 13.681 0.253 0.214 7.0 7.0 7.0

Subset 23.535 (1.251) 0.464 (0.027) 0.298 (0.070) 10.257 (0.728) 0.192 (0.020) 0.301 (0.061) 13.309 (0.877) 0.236 (0.023) 0.289 (0.043) 7.613 (0.660) 0.107 (0.025) 0.370 (0.058) 13.678 0.250 0.315 3.0 3.0 3.0

aGrace 24.089 (0.554) 0.481 (0.007) 0.112 (0.046) 11.915 (0.253) 0.236 (0.005) 0.127 (0.071) 17.432 (0.315) 0.343 (0.006) 0.127 (0.079) 11.573 (0.370) 0.222 (0.008) 0.185 (0.064) 16.252 0.320 0.138 8.0 8.0 8.0

GFLasso 22.087 (1.118) 0.428 (0.030) 0.251 (0.078) 10.039 (0.859) 0.187 (0.020) 0.266 (0.048) 13.209 (0.751) 0.230 (0.020) 0.259 (0.053) 7.770 (0.294) 0.112 (0.010) 0.277 (0.064) 13.276 0.239 0.263 4.0 4.0 5.0

Goscar 23.116 (1.236) 0.454 (0.030) 0.265 (0.081) 10.616 (0.825) 0.201 (0.019) 0.254 (0.047) 13.580 (0.767) 0.240 (0.020) 0.262 (0.043) 7.810 (0.248) 0.114 (0.009) 0.305 (0.060) 13.780 0.252 0.271 5.0 5.0 4.0

ncFGS 21.816 (1.887) 0.419 (0.047) 0.275 (0.063) 10.002 (1.099) 0.183 (0.027) 0.279 (0.077) 13.054 (0.977) 0.226 (0.025) 0.305 (0.058) 7.502 (0.208) 0.104 (0.009) 0.424 (0.037) 13.093 0.233 0.321 2.0 2.0 2.0

Grass 14.408 (1.783) 0.192 (0.061) 0.944 (0.090) 6.651 (0.121) 0.079 (0.005) 0.996 (0.008) 10.237 (1.308) 0.133 (0.046) 0.958 (0.070) 6.606 (0.151) 0.066 (0.006) 0.982 (0.009) 9.475 0.117 0.970 1.0 1.0 1.0

Table 6: Experimental results of large-size random network scenarios

tion methods in terms of the variable selection performance but exhibited comparable predictive and coefficient estimation performances. The main difference between our method and other graph regularization methods is that our method is not based on the shrinkage to select variables or consider graph information. Instead, our method utilizes a discrete optimization formulation that resulted in a better variable selection performance. Besides, it is notable that the proposed method was better than LTLP and TTLP whose constraints are far similar to the proposed method. Third, our method was more robust than other graph regularization methods when the graphs were partially incorrect. For RegN-S1 – RegN-S3, the graphs were correctly provided. In contrast, for RegN-S4 – RegN-S6, some of the edges were randomly removed, and new edges were then randomly added to the graph. As the 30

graphs became incorrect, the performance of the graph regularization methods was degraded. However, the performance of the proposed method did not change significantly as the graphs become incorrect. Similar results were observed under the large-size gene regulatory network scenarios and random network scenarios. We believe that this is a remarkable advantage because graphs are not always correctly provided in real-world cases. 4.4. Hyperparameter Sensitivity Analysis We analyzed the performance of the proposed method with different hyperparameters. We used RegN-L6 and RandN-L1 scenarios to investigate how the performances change as the hyperparameters k and t vary. We checked training and testing RMSE, coefficient RMSE, and variable selection F1-score for each (k, t) pair. The results are presented in Figure 3. Each line in plots shows how the performance measure changes as k changes for fixed t. First of all, we can see that as k increases the training RMSE decreases (Figure 3–(a) and Figure 3–(b)). It is because increasing k allows the model to select more variables, and thus, the model can easily fit the training data. On the other hand, the testing RMSE decreases as k increases until reach a certain value of k, however, it increases after the value (Figure 3–(c) and Figure 3–(d)). That is, overfitting occurs when k is greater than the number of truly nonzero variables. Besides, it is notable that smaller t yielded relatively higher training RMSE but lower testing RMSE. That is, the better generalization was achieved when the model considered more edges. The coefficient RMSEs showed the similar pattern with those of the testing RMSE (Figure 3–(e) and Figure 3–(f)). In both scenarios, the variable selection performance changed as k varies. Similar to the testing RMSE, F1-score increases until k reaches a certain value, then decreases after the value. It is because a model with a small k cannot identify all the true nonzero variables, while a large k leads model to select more redundant variables. On the other hand, t played an important role only in RandN-L1 scenario. In RandN-L1, smaller t resulted higher F1score. However, in RegN-L6 there was no significant difference in F1-score for different values of t. It is because regulatory networks are far more sparse than random networks.

31

(a) RegN-L6 Training RMSE

(b) RandN-L1 Training RMSE

(c) RegN-L6 Testing RMSE

(d) RandN-L1 Testing RMSE

(e) RegN-L6 Coefficient RMSE

(f) RandN-L1 Coefficient RMSE

(g) RegN-L6 F1-Score

(h) RandN-L1 F1-Score

Figure 3: Results of hyperparameter sensitivity analysis on RegN-L6 and RandN-L1

32

(a) Increasing number of variables (p)

(b) Increasing number of edges (q)

(c) Increasing number of observations (n) Figure 4: Experimental results on computation time

4.5. Experiments on Computational Complexity We presented the theoretical results of the computational complexity of the proposed method in Section 3.6. In this section, we examine the experimental results of the empirical computational complexity. We first present the computation time of the proposed method on the gene regulatory and random network scenarios. Table 7 shows the average computation times measured in seconds and their standard deviations (values in the parentheses). For small-size problems with about 100-dimension, the proposed method terminated within 1 second. Besides, for large-size problems with about 2000-dimension, the computation times were about 10 seconds. The results imply that even though the theoretical complexity of the proposed algorithm is O(2p+q ), empirically, it is computationally feasible for data sets with thousands of predictor variables. Furthermore, we tested computation time by varying the number of variables (p), edges (q), and observations (n). We simulated synthetic data set 33

Scenario Time Scenario Time

RegN-S1 0.268 (0.061) RegN-L1 5.535 (1.571)

RegN-S2 RegN-S3 RegN-S4 RegN-S5 0.376 0.759 0.391 0.570 (0.070) (0.422) (0.093) (0.099) RegN-L2 RegN-L3 RegN-L4 RegN-L5 8.089 8.426 8.178 8.809 (4.767) (4.774) (4.782) (3.650)

RegN-S6 0.462 (0.086) RegN-L6 8.727 (4.739)

RandN-S1 RandN-S2 0.820 0.875 (0.244) (0.200) RandN-L1 RandN-L2 5.741 11.264 (3.494) (3.069)

RandN-S3 RandN-S4 0.516 0.467 (0.130) (0.105) RandN-L3 RandN-L4 6.629 10.307 (5.284) (5.418)

Table 7: Computation time of the proposed method on the simulation data sets

similar to that of Yang et al. [26] so that y = Xβ + ε, Xi ∼ N (0, 12 ), βi ∼ N (0, 12 ) for i = 1, · · · , p, and ε ∼ N (0, 0.012 ). The graph G is randomly generated. The results are summarized in Figure 4. First, we checked the computation time by varying the number of variables p from 200 to 5000, fixing the number of edges q = 2000 and observations n = 100. For each combination of p, q, and n, we tried hyperparameters k = {0.25pmax , 0.5pmax , 0.75pmax , 1.0pmax } and t = {0.25q, 0.5q, 0.75q, 1.0q}, where pmax = min{n, p}. For each setting, we repeated 10 times and reported the average computation time. The results are given in Figure 4– (a). Each line represents the results of different hyperparameter settings and the values in parentheses are multipliers used to determine k and t. For example, (0.25, 0.25) implies the results are produced with k = 0.25pmax and t = 0.25q. The computation time increases approximately O(p0.8 ), except that the computation time of (0.25, 0.25) and (0.25, 0.5) increased more rapidly when p > 3000. We also tested the computation time by increasing the number of edges q from 200 to 5000, fixing the number of variables p = 2000 and observations n = 100. Hyperparameters were set as same as the previous experiment. Figure 4–(b) shows the results. Similar to the previous experiment, the computational time increases as the number of edges increases. However, the pattern of the computation time of q is smaller than or equal to 1000 is different from that of q is greater than 1000: the computation time slightly decreases at q = 1200. The computation time increases approximately O(q 0.5 ) when q is smaller than or equal to 1000, and O(q 0.7 ) otherwise. These results imply that some problems with a relatively small number of edges are more difficult than some problems with more edges. However, in general, more edges will lead to more computation time. For the last, we investigated the computation time by varying the number of observations n from 200 to 5000, fixing the number of variable p = 1000, number of edges q = 2000. We tried hyperparameters k = {25, 50, 75, 100} and t = {0.25q, 0.5q, 0.75q, 1.0q}. Unlike two previous experiments, the com34

Measure PRMSE # of Selected Variables # of Associated Edges

Lasso 1.118 59 14

Enet 1.089 145 64

Subset 1.065 15 0

aGrace 1.081 706 1239

GFLasso 1.089 1709 6850

Goscar 1.155 68 18

ncFGS 1.236 31 7

Grass 0.936 105 166

Table 8: Experimental results on glioblastoma data set

putation time decreases as n increases. It is because our original formulation become less difficult as more observations are provided. Therefore, the graph structured projection problem also became easier, and thus, the actual computation time decreased. The computation time decreases dramatically for n ≤ 1000, approximately O(n−1 ). This is because, for n ≤ 1000 the problem is n < p. On the other hand, for n > 1000, we have O(n−0.3 ). 5. Case Study To evaluate the real-world performance of the proposed method, we conducted experiments on two microarray gene expression studies. As same in the simulation study, LTLP and TTLP were not applied because of their high computational cost. 5.1. Glioblastoma Data Set We applied the methods to a microarray gene expression data set with glioblastoma patients of Horvath et al. [45]. The data set consisted of two independent sets, with sample sizes n = 55 and 65. The data set included clinical and gene expression data obtained by high-density Affymetrix HGU133A arrays with 8000 probe ids. In our study, we built a predictive model with gene expressions as the predictor variables, and the log of the time to death as a response variable. For the graph regularization methods, we used Kyoto Encyclopedia of Genes and Genomes (KEGG) regulatory pathways defined on an HGU133A chip. Using the R Bioconductor library HGU133A [46], we identified 224 pathways defined on HGU133A. The pathway information was subsequently merged with the gene expression data, and the genes that were not connected to others were removed. The resulting graph includes 1953 vertices with 9593 edges in total. The distribution of the vertex degrees ranged from 1 to 102, with a mean of 9.8 and a median of 6. We used 1953 gene expressions as the predictor variables of the models to predict the log survival times. To build predictive models, we merged two sets of patients and removed nine patients who were alive at the last follow-up. We then randomly split 35

Figure 5: Subnetworks identified by Grass from glioblastoma data set

the set into training and testing sets of sizes n = 88 and 23, respectively. Five-Fold cross-validation was conducted on the training set to determine the hyperparameters of the models. Subsequently, the final models were trained using all of the training set. Table 8 shows the results in terms of the predictive RMSE measured from the independent testing set, the number of selected variables, and the number of edges associated with the selected variables. We first confirmed that some of the graph regularization methods exhibited smaller predictive errors compared with those of the non-graph methods. Therefore, we concluded that the utilization of the graph information resulted in better predictive models. The proposed method resulted in a slightly smaller prediction error than those of the other methods, which may indirectly suggest the proposed method selected more significant variables. Moreover, our method identified a relatively smaller subnetwork compared to the aGrace and GFLasso. This result implies that the proposed Grass identified small but more significant subnetworks that are related to survival time. On the other hand, the Goscar and ncFGS did not identify significant subnetworks.

36

Figure 6: Subnetworks identified by Grass from rat eye data set

Figure 5 shows the subnetworks identified by the proposed Grass. The largest subnetwork includes genes involving the oxidative phosphorylation and its connected pathways, including the metabolic pathway. Moreover, in the subnetwork, the genes presented on the right side of PPA1, PPA2, and LHPP are associated with the phagosome, collecting duct acid secretion, Vibrio cholera infection, and epithelial cell signaling in the Helicobacter pylori infection. On the other hand, the genes on the left side are associated with Alzheimers, Parkinsons, and Huntington diseases. Likewise, there is a subnetwork that includes the genes CHMP2A, CHMP2B, CHMP1A, CHMP1B, CHMP5, VPS4A, and VPS4B involved in endocytosis, a subnetwork that includes CNTNAP1, NFASC, CNTN1, and NRCAM involved in cell adhesion molecules, and a subnetwork that includes LIPE, PRKAR1A, PRKAR1B, PRKAR2A, and PRKAR2B involved in both apoptosis and the insulin signaling pathway. These results suggest that our method can identify subnetworks that are potentially relevant to the survival time of glioblastoma patients.

37

Measure PRMSE # of Selected Variables # of Associated Edges

Lasso 0.146 29 1

Enet 0.148 30 0

Subset 0.147 16 0

aGrace 0.159 1437 5922

GFLasso 0.156 355 475

Goscar 0.156 25 0

ncFGS 0.163 18 1

Grass 0.144 177 117

Table 9: Experimental results on rat eye data set

5.2. Rat Eye Data Set We conducted the second experiment on a microarray gene expression data set of Scheetz et al. [47]. In the study, F1 animals were intercrossed, and 12-week-old male offspring were selected for tissue harvesting from the eyes, followed by microarray analysis using Affymetrix Rate Genome 230 2.0 (RAT2302) arrays. One of the main interests of this data set is finding genes correlated with gene TRIM32. This gene was reported to cause Bardet-Biedl syndrome [48], which is a genetically heterogeneous disease of multiple organ systems, including the retina. One approach to finding the genes related to TRIM32 is to use regression analysis [12]. The data set has gene expression data of 120 samples obtained by high-density Affymetrix Rat Genome 230 2.0 arrays with 31100 probe ids. In our study, we constructed a predictive model with the gene expression level of gene TRIM32 as a response variable and the remaining gene expressions as predictor variables. Since this kind of study is mainly interested in genes whose expression values across samples are relatively more variable, we computed the variances of the expression levels and selected the top 2000 most variable genes. Subsequently, using R Bioconductor library rat2302 [49], we identified 222 pathways defined on RAT2302 and merged the pathway information with the gene expression data. Subsequently, we removed the genes that had no connection to the others. In the resulting graph, there were 1433 vertices with 5940 edges in total. The distribution of vertex degrees ranged from 1 to 76, with the mean 8.3 and the median 5. We used 1433 gene expressions as predictor variables to predict the expression level of TRIM32. To build predictive models, we randomly split the set of size n = 120 into training set n = 96 and testing set n = 24. 5-fold cross-validation was conducted on the training set to determine the hyperparameters of the models. Once the hyperparameters are fixed, the final models are trained with all of the training set. Table 9 shows the results in terms of predictive RMSE measured on the testing set, the number of selected variables, and the number of edges asso38

ciated with the selected variables. The proposed method produced a slightly smaller prediction error than the other methods. The graph regularization methods except the proposed method showed worse prediction errors than those of Lasso, Enet, and Subset. This result might have arisen from the assumption that the graph regularization methods share: the regression coefficients of the variables adjacent in a graph have similar absolute values. The assumption seems not valid on this data set, and thus, the methods based on the assumption showed unsatisfactory predictive performance. In contrast, our method is based on a less strict assumption to utilize graph information and produced a better predictive model. Subnetworks identified by the Grass are presented in Figure 6. There are several pathways, including metabolic pathways, pathways in cancer, MAPK signaling pathway, Endocytosis, calcium signaling pathway, renal cell carcinoma, hepatitis C, salivary secretion, aldosterone-regulated sodium reabsorption, and vascular smooth muscle contraction. 6. Conclusions In this study, we examine a variable subset selection and regression coefficient estimation problem in linear regression that incorporates a graph structure of the predictor variables. We propose the Grass containing two constraints for two different objectives: the cardinality constraint that controls the number of selected variables, and the graph structured subset constraint that encourages the predictor variables adjacent in the given graph to be simultaneously selected or eliminated from the model. Moreover, we develop an efficient discrete projected gradient descent algorithm to find a good local optimal solution of the Grass. Through simulation and a realworld case study, we compare the performances of the proposed Grass with those of the existing graph regularization methods. The results showed that our method sparsely and correctly identifies the significant predictor variables with a comparable predictive and regression coefficient estimation performances. Moreover, the proposed method is more robust to partially incorrect graphs than the other graph regularization methods, which cannot be avoided in real-world cases. By using this advantage, the Grass yield better predictive performances in both the simulation and case studies compared with those of the existing graph regularization methods. There are several interesting directions for future research. One is to extend our formulation to more general graphs. Although we only consider 39

unweighted and undirected graphs in this study, the proposed method can be extended to address more general graphs, including weighted graphs or directed graphs. Second, our method can be extended to more general supervised learning tasks such as classification or survival analysis. In terms of the computational aspects, the linearized graph structured projection itself is an interesting optimization problem. It is a combinatorial optimization problem of finding subgraphs under some constraints. Developing a more efficient computational algorithm for the projection problem is worthwhile for future study. Credit Author Statement Hyungrok Do: Formal analysis, Investigation, Methodology, Validation, Visualization, Roles/Writing-original draft Myun-Seok Cheon: Conceptualization, Investigation, Methodology, Writing review & editing Seoung Bum Kim: Funding acquisition, Investigation, Methodology, Supervision, Writing review & editing Declaration of Competing Interest We warrant that the manuscript is our original work, has not been published before, and is not currently under consideration for publication elsewhere. We confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. The corresponding author and all of the authors have read and approved the final submitted manuscript. Acknowledgements The authors would like to thank the editor and reviewers for their useful comments and suggestions, which greatly helped in improving the quality of the paper. This research by Hyungrok Do and Seoung Bum Kim was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008691, The Competency Development Program for Industry Specialist) and the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. NRF-2019R1A4A1024732). 40

References References [1] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) (1996) 267– 288. [2] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association 96 (2001) 1348–1360. [3] C.-H. Zhang, et al., Nearly unbiased variable selection under minimax concave penalty, The Annals of statistics 38 (2010) 894–942. [4] X. Shen, W. Pan, Y. Zhu, Likelihood-based selection and sharp parameter estimation, Journal of the American Statistical Association 107 (2012) 223–232. [5] L. Dicker, B. Huang, X. Lin, Variable selection and estimation with the seamless-l0 penalty, Statistica Sinica (2013) 929–962. [6] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2005) 301–320. [7] H. D. Bondell, B. J. Reich, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar, Biometrics 64 (2008) 115–123. [8] W. Jang, J. Lim, N. A. Lazar, J. M. Loh, D. Yu, Regression shrinkage and grouping of highly correlated predictors with horses, arXiv preprint arXiv:1302.0256 (2013). [9] J. Huang, P. Breheny, S. Lee, S. Ma, C.-H. Zhang, The mnet method for variable selection, Statistica Sinica (2016) 903–923. [10] G. Tutz, J. Ulbricht, Penalized regression with correlation-based penalty, Statistics and Computing 19 (2009) 239–253. [11] Z. J. Daye, X. J. Jeng, Shrinkage and model selection with correlated variables via weighted fusion, Computational Statistics & Data Analysis 53 (2009) 1284–1298. 41

[12] J. Huang, S. Ma, H. Li, C.-H. Zhang, The sparse laplacian shrinkage estimator for high-dimensional regression, Annals of statistics 39 (2011) 2021. [13] M. El Anbari, A. Mkhadri, Penalized regression combining the l1 norm and a correlation based penalty, Sankhya B 76 (2014) 82–102. [14] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (2006) 49–67. [15] B. A. Turlach, W. N. Venables, S. J. Wright, Simultaneous variable selection, Technometrics 47 (2005) 349–363. [16] J. E. Vogt, V. Roth, A complete analysis of the `1,p group-lasso, in: Proceedings of the 29th International Conference on International Conference on Machine Learning, Omnipress, pp. 1091–1098. [17] L. Jacob, G. Obozinski, J.-P. Vert, Group lasso with overlap and graph lasso, in: Proceedings of the 26th annual international conference on machine learning, ACM, pp. 433–440. [18] P. Zhao, G. Rocha, B. Yu, et al., The composite absolute penalties family for grouped and hierarchical variable selection, The Annals of Statistics 37 (2009) 3468–3497. [19] R. Jenatton, J. Mairal, G. Obozinski, F. R. Bach, Proximal methods for sparse hierarchical dictionary learning., in: ICML, 2010, Citeseer, pp. 487–494. [20] J. Liu, J. Ye, Moreau-yosida regularization for grouped tree structure learning, in: Advances in Neural Information Processing Systems, pp. 1459–1467. [21] C. Li, H. Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008) 1175–1182. [22] C. Li, H. Li, Variable selection and regression analysis for graphstructured covariates with an application to genomics, The annals of applied statistics 4 (2010) 1498.

42

[23] S. Kim, K.-A. Sohn, E. P. Xing, A multivariate regression approach to association analysis of a quantitative trait network, Bioinformatics 25 (2009) i204–i212. [24] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2005) 91–108. [25] W. Pan, B. Xie, X. Shen, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010) 474–484. [26] S. Yang, L. Yuan, Y.-C. Lai, X. Shen, P. Wonka, J. Ye, Feature grouping and selection over an undirected graph, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 922–930. [27] X. Shen, H.-C. Huang, W. Pan, Simultaneous supervised clustering and feature selection over a graph, Biometrika 99 (2012) 899–914. [28] S. Kim, W. Pan, X. Shen, Network-based penalized regression with application to genomic data, Biometrics 69 (2013) 582–593. [29] D. Bertsimas, A. King, R. Mazumder, et al., Best subset selection via a modern optimization lens, The annals of statistics 44 (2016) 813–852. [30] R. Miyashiro, Y. Takano, Mixed integer second-order cone programming formulations for variable selection in linear regression, European Journal of Operational Research 247 (2015) 721–731. [31] R. Miyashiro, Y. Takano, Subset selection by mallows cp: A mixed integer programming approach, Expert Systems with Applications 42 (2015) 325–331. [32] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, T. Matsui, Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Journal of Global Optimization (2016) 1–16. [33] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, T. Matsui, Best subset selection for eliminating multicollinearity, Journal of the Operations Research Society of Japan 60 (2017) 321–336. 43

[34] R. Mazumder, P. Radchenko, The discrete dantzig selector: Estimating sparse linear models via mixed integer linear optimization., IEEE Trans. Information Theory 63 (2017) 3053–3075. [35] H. Hazimeh, R. Mazumder, Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms, arXiv preprint arXiv:1803.01454 (2018). [36] Q. Yao, J. T. Kwok, More efficient accelerated proximal algorithm for nonconvex problems, ArXiv preprint (2016). [37] Q. Li, Y. Zhou, Y. Liang, P. K. Varshney, Convergence analysis of proximal gradient with momentum for nonconvex optimization, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp. 2111–2119. [38] R. Shang, Z. Zhang, L. Jiao, W. Wang, S. Yang, Global discriminativebased nonnegative spectral clustering, Pattern Recognition 55 (2016) 172–182. [39] R. Shang, W. Wang, R. Stolkin, L. Jiao, Non-negative spectral learning and sparse regression-based dual-graph regularized feature selection, IEEE transactions on cybernetics 48 (2017) 793–806. [40] Y. Meng, R. Shang, L. Jiao, W. Zhang, Y. Yuan, S. Yang, Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering, Neurocomputing 290 (2018) 87–99. [41] Y. Meng, R. Shang, L. Jiao, W. Zhang, S. Yang, Dual-graph regularized non-negative matrix factorization with sparse and orthogonal constraints, Engineering Applications of Artificial Intelligence 69 (2018) 24–35. [42] Y. Nesterov, Introductory lectures on convex optimization: A basic course, volume 87, Springer Science & Business Media, 2013. [43] F. Tai, W. Pan, X. Shen, Bayesian variable selection in regression with networked predictors, in: High-Dimensional Data Analysis, World Scientific, 2011, pp. 147–165.

44

[44] J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of statistical software 33 (2010) 1. [45] S. Horvath, B. Zhang, M. Carlson, K. Lu, S. Zhu, R. Felciano, M. Laurance, W. Zhao, S. Qi, Z. Chen, et al., Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a molecular target, Proceedings of the National Academy of Sciences 103 (2006) 17402–17407. [46] M. Carlson, hgu133a.db: Affymetrix Human Genome U133 Set annotation data (chip hgu133a), 2016. R package version 3.2.3. [47] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson, A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, et al., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proceedings of the National Academy of Sciences 103 (2006) 14429–14434. [48] A. P. Chiang, J. S. Beck, H.-J. Yen, M. K. Tayeh, T. E. Scheetz, R. E. Swiderski, D. Y. Nishimura, T. A. Braun, K.-Y. A. Kim, J. Huang, et al., Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet–biedl syndrome gene (bbs11), Proceedings of the National Academy of Sciences 103 (2006) 6287–6292. [49] M. Carlson, rat2302.db: Affymetrix Rat Genome 230 2.0 Array annotation data (chip rat2302), 2016. R package version 3.2.3.

45