Modeling of various biological networks via LCMARS

Modeling of various biological networks via LCMARS

Accepted Manuscript Title: Modeling of various biological networks via LCMARS Author: Ezgi Ayyildiz Vilda Purutcuoglu PII: DOI: Reference: S1877-7503...

636KB Sizes 0 Downloads 41 Views

Accepted Manuscript Title: Modeling of various biological networks via LCMARS Author: Ezgi Ayyildiz Vilda Purutcuoglu PII: DOI: Reference:

S1877-7503(17)30448-9 https://doi.org/doi:10.1016/j.jocs.2018.08.009 JOCS 911

To appear in: Received date: Revised date: Accepted date:

23-4-2017 15-11-2017 25-8-2018

Please cite this article as: Ezgi Ayyildiz, Vilda Purutc¸uo˘glu, Modeling of various biological networks via LCMARS, (2018), https://doi.org/10.1016/j.jocs.2018.08.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Ezgi Ayyıldıza , Vilda Purut¸cuo˘glua,∗

of Statistics, Middle East Technical University, Ankara, Turkey

cr

a Department

ip t

Modeling of various biological networks via LCMARS

us

Abstract

In system biology, the interactions between components such as genes, proteins, can be represented by a network. To understand the molecular mechanism

an

of complex biological systems, construction of their networks plays a crucial role. However, estimation of these biological networks is a challenging problem because of their high dimensional and sparse structures. Several statistical

M

methods are proposed to overcome this issue. The Conic Multivariate Adaptive Regression Splines (CMARS) is one of the recent nonparametric methods developed for high dimensional and correlated data. This model is suggested

d

to improve the performance of the Multivariate Adaptive Regression Spline

te

(MARS) approach which is a complex model under the generalized additive models. From previous studies, it has been shown that MARS can be a promis-

Ac ce p

ing model for the description of steady-state activations of biological networks if it is modified as a lasso-type regression via the main effects. In this study, we convert the full description of CMARS as a loop-based approach, so-called LCMARS, by including both main and second-order interaction effects since this description has performed better in benchmark real datasets. Here, we generate various scenarios based on distinct distributions and dimensions to compare the performance of LCMARS with MARS and Gaussian Graphical Model (GGM) in terms of accuracy measures via Monte Carlo runs. Additionally, different real biological datasets are used to observe the performance of underlying methods. Keywords: Conic multivariate adaptive regression splines, Gaussian graphical ∗ Corresponding

author. Email address: [email protected] (Vilda Purut¸cuo˘ glu)

Preprint submitted to Journal of Computational Science

September 4, 2018

Page 1 of 22

ip t

Figure 1: Simple representation of a network with 1000 nodes.

model, sparse networks

cr

1. Introduction

us

The Gaussian Graphical Model (GGM) is most widely used parametric and undirected graphical model that is based on the multivariate normality assumption. This model constructs the graph structure by using the conditional independency between nodes [1]. Under normality, the conditional independency

an

5

means that there is no relation between corresponding nodes which is represented by a zero entry in the precision matrix. In inference on GGM, the lasso

M

regression is applied in such a way that each node is regressed against all remaining nodes, simultaneously and the precision matrix is estimated by using 10

the significant regression coefficients [1]. However, for large systems under high

d

sparsity, the inference of GGM is computationally demanding. Because of this

te

challenge and strict normality assumption, a non-parametric model, so called Multivariate Adaptive Regression Spline (MARS), has been proposed as an al-

Ac ce p

ternative approach to GGM [2]. In the study of Ayyıldız et al. [2], MARS 15

was applied with only the main terms and the performance of both MARS and GGM was compared in terms of the model accuracy and computational cost. From the analyses, it has been shown that MARS results with higher accuracy and gets huge gain in computational time, specifically, under large systems. On the other hand, from comparative analyses, it has been shown that CMARS

20

overperforms MARS in various datasets. Hereby, in this study, we convert the full description of CMARS as a loop-

based approach, so-called LCMARS, similar to the MARS application in biological networks. But different from this model structure, we include both main and second-order interaction effects since this description has performed better 25

in benchmark real datasets as the second-order interactions can be representative of the feed-forward loop in biological networks [3, 4]. In Fig. 1, a simple 2

Page 2 of 22

representation of a network with 1000 nodes is given as an example of the biological networks. Hereby, in this study, we compare LCMARS with MARS and

30

ip t

GGM under both simulated and real datasets. We generate various scenarios

and evaluate the performance of all three methods in terms of accuracy mea-

cr

sures via Monte Carlo runs. Accordingly, in the rest of the paper, we present the

following organization. In Section 2, GGM, MARS and CMARS are explained.

Finally, we conclude the findings in Section 4.

2. Methods

an

35

us

Then, we represent the application and the results of the comparative analyses.

Gaussian Graphical Model

M

The Gaussian Graphical Model (GGM) is a widely used undirected graphical model whose states are described as the multivariate normal distribution. In GGM, the structure of the graph is constructed by using the conditional independencies of variables under the normality assumption. Therefore, if two

d

40

nodes, i.e., variables, in the graph are conditionally independent, no edge is

te

drawn between these nodes and the associated entry in the precision matrix which is the inverse of the variance-covariance matrix is set to zero [1].

Ac ce p

In GGM, a regression model is build for each node against all remaining

45

nodes as described below [5]. Y(p) = βY(−p) + ε,

(1)

where Y(p) and Y(−p) denote the state of the pth node and the states of all nodes

except the pth node, respectively. In this model, Y(p) and Y(j) (j = 1, 2, ..., p −

1) are conditionally independent when the associated regression coefficient βj equals to zero.

50

Multivariate Adaptive Regression Spline Multivariate Adaptive Regression Spline (MARS) is a nonparametric regression method which is suitable for high-dimensional and correlated data under nonlinearity [6]. It can be seen as a generalization of stepwise linear regression

3

Page 3 of 22

Figure 2: Simple representation of the smoothing method for the curvature structure via basis

ip t

functions of MARS.

model which has the following form [6]. f (y) = β0 +

M X

βm ψm (y),

m=1

(2)

us

55

cr

[7]. In MARS, the piecewise linear basis functions (BFs) are used to obtain a

in which each ψm (y) is a function in a set of basis functions and M shows the

  t − y,   if y < t and t − y + = .  0, otherwise (3)

otherwise

In Eq. (3), the (+) sign indicates the positive part in the expression and t is     called the knot. The functions y − t + and t − y + are able to smooth a curve

d

60

if y > t

M

  y − t,   y−t + =  0,

an

number of basis functions. Accordingly, the piecewise linear basis functions     y − t + and t − y + are defined as follows.

te

by pieces of linear expressions as shown in Fig. 2. In the computation of MARS model, this method has two-stage procedure,

Ac ce p

namely, forward stage and backward stage. In the forward stage, the basis functions are iteratively added to the model which has initially a sole intercept

65

term. But the final model at the end of this stage typically overfits the data [8]. To solve this challenge, the backward stage is applied by removing the term whose elimination causes the smallest increase in the residual squared error. In order to get the best estimated model with an optimal number of terms, λ, the generalized cross-validation value (GCV) is calculated [6, 7]. The mathematical

70

expression of GCV is presented as below. PN GCV(λ) =

− fˆλ (yi ))2 . (1 − M (λ)/N )2 i=1 (yi

(4)

In Eq. (4), fˆλ is the best estimated model with the λ number of terms. M (λ) stands for the effective number of parameters and N represents the number of 4

Page 4 of 22

observations. Here, M (λ) can be found via M (λ) = r + cK, where r denotes the number of linearly independent basis functions and K describes the number of selected knots during the forward stage. Finally, c is the cost for the optimization

ip t

75

of the basis function and the smoothing parameter of the model. This value is

cr

typically equated to c = 3, whereas, if the model is restricted as an additive model, it is set to c = 2 [9]. Hence, the model which minimizes GCV is chosen

80

us

as a final model in the backward stage. Conic Multivariate Adaptive Regression Spline

Conic Multivariate Adaptive Regression Spline (CMARS) is a modified ver-

an

sion of MARS to overcome the problem of overfitting in MARS [10]. Thus, once a complex model is found in the forward stage of MARS, the penalized residual sum of squares (PRSS) presented in Eq. (5) is computed for the parameter estimation in CMARS. PRSS is described as below [11]: M N 2 max X X X X Z 2 α (yi −f (x¯i ))2 + λm θm [Dr,s ψm (tm )]2 dtm . (5) PRSS = m=1

|α|=1

r
Qm

d

i=1

M

85

In this expression, λm stands for the penalty parameters, V (m) is the vari-

te

able set associated with the mth basis function ψm , tm denotes the vector of variables which contribute to the mth basis function and thus, shows the inte-

Ac ce p

gration variable of the complexity term, and finally, Qm denotes a sufficiently 90

large parallelpipe where the integration takes place, i.e., where the input data are located. The optimization problem in Eq. (5), which controls both the accuracy and the complexity of the model, can be also rearranged as a Tikhonov regularization (TR) as the following form: ¯ 2 + λkLθk2 . PRSS ≈ ky − ψ(d)θk 2 2

(6)

In Eq. (6), L implies a diagonal (Mmax + 1) × (Mmax + 1)-matrix and θ refers

95

to an ((Mmax + 1) × 1)-dimensional vector of parameters and k · k22 denotes the

L2 -norm of the given term. By this way, the conic quadratic programming can be performed to solve the regularization problem in Eq. (5). In this study, we construct the loop-based CMARS model similar to a regression model as given in Eq. (1), and call as loop-based CMARS, shortly 5

Page 5 of 22

100

LCMARS. Thereby, we regress each node in the graph against all the remaining nodes iteratively and estimate the regression coefficients to detect the relation

ip t

between corresponding nodes. In this calculation, the steps of LCMARS method can be listed as below:

105

interaction effects by using the forward stage of MARS.

cr

(1) Construct p number of models which include both main and second-order

us

(2) Compute PRSS for each model without moving the backward stage of MARS.

(3) Determine the best model which has the optimal penalty term.

110

an

(4) For each selected LCMARS model, assign 1 to associated entry of the adjacency matrix if there is a relation between the response and explanatory

M

variables.

(5) Apply the AND rule to build symmetric adjacency matrix. Here, the AND rule means that we assign 1 for each pair of nodes (i, j) in the precision

115

d

matrix if both (i, j) and (j, i) entries in Θ indicate the edges. (6) Make comparison between the estimated adjacency matrix and the true

te

binary precision matrix to compute accuracy measures, which we choose as

Ac ce p

recall, F-measure, Jaccard index and accuracy value.

3. Application

In the application, we perform LCMARS, MARS and GGM methods to

120

construct the networks of simulated and real biological datasets. We implement LCMARS by using the main effects and the second-order interactions of the model. The significant regression coefficients of this model are directly taken from the estimation of the adjacency matrix which is used to indicate the undirected link between nodes in a network. Here, the estimated adjacency matrix

125

can be nonsymmetric since the response variables in each regression model have different significant explanatory variables. So we apply the AND rule which means that we obtain an entry 1 when both (i, j) and (j, i) entries of the estimated adjacency matrix are one. By using the AND rule, we infer more sparse 6

Page 6 of 22

matrices similar to the real biological networks. In the end, we compare the 130

symmetric adjacency matrix with the true network and evaluate the different

ip t

accuracy measures such as the recall, F-measure, Jaccard index and accuracy.

equations.

F-measure

=

Jaccard index

=

Accuracy

=

TP , TP+FN 2TP , 2TP + FP + FN TP , TP+FP+FN TP+TN . TP+TN+FP+FN

us

=

an

Recall

cr

The mathematical description of each measure is presented in the following

(7) (8) (9)

(10)

135

M

In these expressions, the true positive (TP) implies the number of correctly classified objects that have positive labels, the true negative (TN) indicates the number of correctly classified objects that have negative labels, the false positive

d

(FP) shows the number of misclassified objects that have negative labels, and

te

the false negative (FN) presents the number of misclassified objects that have positive labels. Here, recall represents the ratio of correctly estimated positive links (TP) over the total number of links in the true network. In other words,

Ac ce p

140

it only deals with the number of TP and does not pay attention to the number of TN. On the contrary, the accuracy is the ratio of correctly classified genes in both labels to the total of all classified genes. So it evaluates the performance of methods from both sides. Additionally, in all these measures, 1 indicates

145

perfect accuracies, whereas, zero implies no success at all in terms of accuracy. 3.1. Application of Simulated Study The simulated datasets are created under mainly two different scenarios such that the generated matrices come from normal and non-normal distributions. The multivariate normally distributed datasets are generated under the scale-

150

free feature by using the huge package in the R programming language. For nonnormal datasets, we use different distributions such as Student-t, log-normal, 7

Page 7 of 22

Cauchy, and mixture of log-normal and normal distribution. For Student-t distribution, we use two different degrees of freedom (df), namely, df=7 and

155

ip t

df=15, so that we can detect the results of the Student-t and the results of its more normally distributed versions, respectively. The outputs of LCMARS,

cr

MARS and GGM methods under all scenarios are presented in Tables 1 - 5.

Furthermore, for all schemes, datasets are simulated under various dimen-

us

sions, p, listed as 40, 100 and 200 number of genes in the system. Moreover, the number of observations for each dataset equals to 20 and all the results are 160

based on the 1000 Monte Carlo runs. Finally in all scenarios, the topology of

an

the networks are chosen as the scale-free networks as this is the most common view of the biological networks [12].

On the other hand, we perform the graphical lasso, also called glasso, method

165

M

[9] to infer the model parameters in GGM. This method basically finds the entries of the precision matrix via the penalized likelihood function whose penalty constant is controlled by the L1 -norm. In this study, the optimal penalty con-

d

stant is selected by the STARS criterion [13].

te

From the analyses, we observe that for normal distribution, recall and accuracy values of both methods are close to each other under all dimensions. On the other side, F-measure and Jaccard index values of LCMARS are greater than

Ac ce p

170

GGM. Additionally, as seen in Table 1, when the dimension of data increases, the differences between methods get larger. Additionally, as observed in Table 2, under the Student-t distribution for all

dimensions and both degrees of freedom (df=7 and df=15), F-measure, Jaccard

175

index and accuracy values of LCMARS are higher than both MARS and GGM. Similar to the previous result, GGM gets worse when the dimension increases. In Table 3, we observe that the accuracy, F-measure and Jaccard index

values of LCMARS under log-normal datasets are a bit more than GGM when the dimension of the network is small, i.e., p = 40. Whereas, the difference 180

between LCMARS and GGM significantly grows for larger dimensions. Besides these, recall values of GGM are higher than LCMARS for high dimensional systems. Moreover, the accuracy, F-measure and Jaccard index values of MARS 8

Page 8 of 22

Table 1: Comparison of the recall, F-measure, Jaccard index and accuracy for Normally distributed data. p is the total number of genes in the system. The best models are shown in

MARS

Jaccard index

Accuracy

40

0.3483

0.4608

0.2995

0.9400

100

0.3392

0.4492

0.2896

200

0.3359

0.4455

0.2869

40

0.3539

0.4335

100

0.3462

0.3773

200

0.3426

0.3264

40

0.3502

100

0.3388

200

0.3571

cr

F-measure

0.9752

0.9876

us

GGM

Recall

0.2773

0.9309

0.2330

0.9655

an

LCMARS

p

0.1958

0.9782

0.4505

0.2910

0.9369

0.4556

0.2949

0.9759

0.1105

0.9570

M

Method

ip t

boldface.

0.1990

d

are close to LCMARS under small and moderate systems, i.e., p = 40 and 100,

185

te

respectively. On the other side, the difference between LCMARS and MARS get larger under bigger systems when we specifically compare F-measure and

Ac ce p

Jaccard index.

Finally, the results of the Cauchy distribution and mixture of log-normal

with normal distributions, are similar to the previous results as they can be seen in Table 4 and Table 5, respectively. In other words, LCMARS produces

190

higher scores than MARS and GGM in terms of the accuracy, F-measure and Jaccard index. Furthermore, although recall values of GGM are higher than LCMARS and MARS, the differences get smaller when the dimension increases. Hence, from all findings we detect that apart from small systems, LCMARS

performs better than both MARS and GGM under all scenarios and dimensions.

195

Additionally, when we compare the results of LCMARS and MARS, it can be observed than although the results are close to each other under small networks, LCMARS stands out from MARS for large networks. Therefore, we consider that the suggested approach can be a promising alternative of GGM in the 9

Page 9 of 22

Table 2: Comparison of the recall, F-measure, Jaccard index and accuracy for Student-t distributed data with degrees of freedom (df) 7 and 15. p is the total number of genes in the

MARS

LCMARS

Jaccard index

40

0.3379

0.4468

0.2882

100

0.3346

0.4425

0.2844

0.9747

200

0.3339

0.4426

0.2846

0.9873

40

0.3895

0.3117

0.1853

0.8676

100

0.3780

0.2064

0.1155

0.9090

200

0.3744

0.1335

0.0718

0.9238

40

0.3422

0.4358

0.2790

0.9335

100

0.3349

0.4403

0.2826

0.9745

200

0.3540

0.1996

0.1109

0.9574

40

0.3382

0.4468

0.2883

0.9374

100

0.3347

0.4425

0.2844

0.9747

0.3341

0.2846

0.9873

MARS

cr

0.9374

40

0.3972

0.2971

0.1749

0.8564

100

0.3833

0.1909

0.1058

0.8998

200

0.3792

0.1249

0.0668

0.9179

40

0.3415

0.4353

0.2786

0.9335

100

0.3351

0.4409

0.2830

0.9745

200

0.3544

0.1997

0.1109

0.9574

Ac ce p

GGM

Accuracy

0.4428

200

15

us

GGM

F-measure

an

7

Recall

M

LCMARS

p

d

Method

te

df

ip t

system. The best models are shown in boldface.

construction of complex biological networks. In Fig. 4, we shown the number of

200

success for each model based on each accuracy measure. The plot also indicates that LCMARS is more accurate than other models for all selected measures. 3.2. Application of Biological Data To evaluate the performance of methods under biological data, we use different synthetic and real benchmark datasets. In this application, the synthetic 10

Page 10 of 22

Table 3: Comparison of the recall, F-measure, Jaccard index and accuracy for Log-Normally distributed data. p is the total number of genes in the system. The best models are shown in

MARS

Jaccard index

Accuracy

40

0.3559

0.4615

0.3000

0.9388

100

0.3391

0.4534

0.2931

200

0.3357

0.4565

0.2961

40

0.3390

0.5063

100

0.4664

0.1198

200

0.4215

0.0851

40

0.3390

100

0.3397

200

0.3586

cr

F-measure

0.9756 0.9880

us

GGM

Recall

0.3390

0.9512

0.0637

0.7950

an

LCMARS

p

0.0444

0.8643

0.4600

0.2990

0.9410

0.4424

0.2840

0.9745

0.1061

0.9548

M

Method

ip t

boldface.

0.1918

d

Table 4: Comparison of the recall, F-measure, Jaccard index and accuracy for Cauchy distributed data. p is the total number of genes in the system. The best models are shown in

p

Recall

Ac ce p

Method

te

boldface.

LCMARS

GGM

MARS

F-measure

Jaccard index

Accuracy

40

0.3487

0.4563

0.2958

0.9388

100

0.3385

0.4588

0.2976

0.9762

200

0.3351

0.4694

0.3070

0.9889

40

0.7413

0.1585

0.0861

0.4191

100

0.5532

0.0906

0.0475

0.6689

200

0.4468

0.0708

0.0367

0.8243

40

0.3552

0.4295

0.2737

0.9303

100

0.3395

0.4422

0.2839

0.9745

200

0.3570

0.2038

0.1135

0.9583

11

Page 11 of 22

Table 5: Comparison of the recall, F-measure, Jaccard index and accuracy for mixture of log-normal and normal data. p is the total number of genes in the system. The best models

MARS

Accuracy

40

0.3471

0.4621

0.3005

0.9405

100

0.3388

0.4586

0.2974

200

0.3355

0.4593

0.2984

40

0.4593

0.2505

100

0.4112

0.1631

200

0.3816

0.1264

40

0.3548

100

0.3403

200

0.3564

cr

Jaccard index

0.9762 0.9881

0.1435

0.7963

0.0888

0.8739

0.0675

0.9211

0.4322

0.2759

0.9312

0.4370

0.2796

0.9739

0.1140

0.9586

0.2047

data are taken from benchmark synthetic data tools such as SynTRen (Synthetic

d

205

F-measure

us

GGM

Recall

an

LCMARS

p

M

Method

ip t

are shown in boldface.

te

transcriptional regulatory networks [14] and GeneNetWeaver (GNW) [15]. The synthetic gene expression datasets from these tools approximate the experimen-

Ac ce p

tal data and enable as to validate different network construction methods as they present the true network model too.

210

Accordingly, in this study, the gene expression datasets from DREAM 4

(2009) in-silico size 10 and size 100 are used. DREAM is the international Dialogue for Reverse Engineering Assessments and Methods competition which publishes network inference challenges. Hereby, the DREAM 4 in-silico size 10 challenge consists of five networks involving 10 genes. Moreover, in-silico

215

size 100 challenge includes five networks involving 100 genes. For each network, multifactorial data are available which contain steady-state levels of variations in the network based on gene expression characteristics of two well-studied systems, namely, E.coli and S.cerevisiae. These datasets are accessible from the R package DREAM4. Hence, in our analyses, we use three datasets from in-silico size 10

220

and two datasets from in-silico size 100. The results of in-silico size 10 and 12

Page 12 of 22

size 100 networks are presented in Table 6 and Table 7, respectively. From the findings, it is observed that LCMARS has better scores than MARS and GGM in

ip t

terms of recall, F-measure and Jaccard index for two in-silico size 10 datasets. Moreover, the accuracy values of MARS are higher than other methods for

these two datasets. Additionally, for the third data, LCMARS is better under

cr

225

all accuracy measures. On the other hand, in-silico size 100 datasets, accuracy

us

results of LCMARS and MARS methods are very close to each other. However, the GGM method cannot be applicable since this method cannot work when the system’s elements have high correlation.

an

Table 6: Comparison of the recall, F-measure, Jaccard index and accuracy for DREAM-4

Jaccard index

Accuracy

LCMARS

0.647

0.647

0.478

0.760

MARS

0.471

0.616

0.444

0.800

GGM

0.294

0.454

0.294

0.760

LCMARS

0.474

0.563

0.391

0.720

3

230

F-measure

MARS

0.368

0.538

0.368

0.760

GGM

0.263

0.416

0.263

0.720

Ac ce p

2

Recall

d

1

Method

te

Data

M

in-silico size 10. The best models are shown in boldface.

LCMARS

0.438

0.583

0.412

0.800

MARS

0.375

0.522

0.353

0.780

GGM

0.312

0.476

0.312

0.780

Furthermore, we compare the performance of three methods with a simulated

toy example of gene expression data generated by the GNW generator by using known biological interaction networks of E.coli. The toy data contain 64 samples and 64 genes and are available in the R package grndata. We present the outcomes of this dataset in Table 8. From the outcomes, we observe that recall,

235

F-measure and Jaccard index of LCMARS are better than others. On the other side, the accuracy values of GGM is slightly higher than LCMARS and it is

13

Page 13 of 22

Table 7: Comparison of the recall, F-measure, Jaccard index and accuracy for DREAM-4

2

Recall

F-measure

Jaccard index

Accuracy

LCMARS

0.275

0.296

0.174

0.943

MARS

0.248

0.298

0.175

LCMARS

0.234

0.267

0.154

MARS

0.207

0.258

0.148

0.949

cr

1

Method

0.934

us

Data

ip t

in-silico size 100. The best models are shown in boldface.

an

very close to the MARS result.

0.939

Table 8: Comparison of the recall, F-measure, Jaccard index and accuracy for toy data and E-GEOD-9891. The best models are shown in boldface.

Method LCMARS

F-measure

Jaccard index

Accuracy

0.360

0.347

0.210

0.842

MARS

0.234

0.342

0.207

0.895

0.134

0.236

0.134

0.899

LCMARS

0.703

0.825

0.703

0.703

MARS

0.306

0.469

0.306

0.306

GGM

0.091

0.167

0.091

0.091

Ac ce p

E-GEOD-9891

GGM

te

d

Toy

Recall

M

Data

Finally, as a real biological data, we apply the transcription profiling of 285

human ovarian tumour samples named as E-GEOD-9891. The data are cohorts

240

of 285 patients with epithelial ovarian, primary peritoneal, or fallopian tube cancer, diagnosed between 1992 and 2006. The samples are collected through Australian Ovarian Cancer Study with a sample size (n=206), Royal Brisbane Hospital (n=22), Westmead Hospital (n=54) and Netherlands Cancer Institute (n=3) [16]. In this dataset, there are 11 target genes which are chosen to identify

245

the novel molecular subtypes of the ovarian cancer by the gene expression profiling with a linkage to clinical and pathological features [16]. According to the results in Table 8, it is seen that LCMARS performs significantly better than 14

Page 14 of 22

(b)

(c)

(d)

ip t

(a)

Figure 3: (a) True network of the E-GEOD9891 data, (b) estimated network via LCMARS,

cr

(c) estimated network via MARS and (d) estimated network via GGM.

Figure 4: Bar charts of the number of successive models for simulated (dark blocks) and real

us

(light blocks) datasets based on recall, accuracy, F-measure and Jaccard index.

other methods in terms of all measures. Thereby, as shown in Fig. 3, LCMARS

250

an

can detect 37 links out of 55 links of the true network. Whereas, MARS can only capture 13 links in the same 55 links. On the other side, GGM cannot

M

detect any link, i.e., edges between genes. We summarize the performance of all models in Fig. 4, similar to simulated data analyses, in order to count the

255

te

4. Conclusion

d

success of each model.

In this study, we have proposed CMARS, a nonparametric regression model,

Ac ce p

as an alternative modelling of GGM to construct the biological networks. Because from the previous studies [2], we have observed that the performance of MARS, which is also suggested as competitive method of GGM, and is seen as the basic modelling approach behind CMARS, is better than GGM in terms of

260

accuracy and computational time. Here, we have implemented the loop-based CMARS, called LCMARS, by including main and second-order interaction effects in the regression model. To compare the performance of LCMARS with MARS and GGM, we have conducted simulation studies under distinct distributions and dimensions. Also, we applied different synthetic and real biolog-

265

ical benchmark datasets. From the comparative analyses, we have seen that LCMARS has higher accuracy than both MARS and GGM under both simulated and real datasets under non-normal and normal distribution. Furthermore, in real data, LCMARS stands out from MARS and GGM methods. Therefore, 15

Page 15 of 22

we have concluded that LCMARS can be a strong alternative to GGM since 270

LCMARS overcomes the limitations of GGM such as strict normality assump-

ip t

tion and low accuracy under large systems.

As the future work of the study, we consider to adapt this approach to robust

cr

version of CMARS, so-called RCMARS [17]. The RCMARS model is shortly

suggested to detect the model which has the lowest false positive rate [18] and resulting in to find the minimum number of links in the system. We think that

us

275

this new model can be converted as a lasso-model similar to LCMARS in order to construct the major core subnetworks. Furthermore, we consider that the

an

description of the spline functions in MARS, CMARS and RCMARS model can be extended by covering not only the linear relation, but also non-linear 280

relationship between genes. For this purpose, we think to perform p-splines

M

approach [19] which has unspecified smoothing functions in modelling. Acknowledgement

The authors thank the BAP project (no: BAP-01-09-2016-002) at Middle

hard Wilhelm Weber for his valuable discussion in different stages of the study.

te

285

d

East Technical University for its support. They also thank to Prof. Dr. Ger-

Finally, they thank to anonymous referee and editor for their valuable comments

Ac ce p

which improve the quality of the paper.

References

[1] J. Whittaker, Graphical Models in Applied Multivariate Statistics, John

290

Wiley and Sons, New York, 1990.

[2] E. Ayyıldız, M. A˘ graz, V. Purut¸cuo˘glu, MARS an an alternative approach of gaussian graphical model for biochemical networks, Journal of Applied Statistics 0 (2016) 1–19. doi:10.1080/02664763.2016.1266464.

[3] A. L. Barabasi, Z. N. Oltvai, Network biology: Understanding the cell’s 295

functional organization, Nature Reviews Genetics 5 (2004) 101–113. doi: 10.1038/nrg1272.

16

Page 16 of 22

[4] B. H. Junker, F. Schreiber, Analysis of biological networks, John Wiley

ip t

and Sons, Inc., Hoboken, New Jersey, 2008. [5] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of 300

the Royal Statistical Society 58 (1) (1996) 267–288.

cr

[6] J. H. Friedman, Multivariate adaptive regression splines, The Annals of

us

Statistics 19 (1) (1991) 1–67.

[7] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference and prediction, Springer Verlag, New York, 2009.

an

305

[8] A. R. Barron, X. Xiao, Discussion:

multivariate adaptive regression

splines, The Annals of Statistics 19 (1) (1991) 67–82. doi:10.1214/aos/

M

1176347964.

[9] J. H. Friedman, T. Hastie, R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (3) (2008) 432–441.

d

310

te

doi:10.1093/biostatistics/kxm045. ˙ Batmaz, G. W. Weber, Modeling, Dynamics, Optimization [10] F. Yerlikaya, I.

Ac ce p

and Bioeconomics, Vol. 37, Springer International Publishing, 2014, Ch. A review and new contribution on conic multivariate adaptive regression

315

splines (CMARS): A powerful tool for predictive data mining, pp. 695–722.

[11] P. Taylan, G. W. Weber, F. Yerlikaya, A new approach to multivariate adaptive regression splines by using tikhonov regularization and continuous optimization, TOP 18 (2) (2010) 377–395.

doi:10.1007/

s11750-010-0155-7.

320

[12] U. Alon, An introduction to systmes biology: Design princple of biological circuits, Chapman and Hall CRC, 2007. [13] H. Liu, K. Roeder, L. Wasserman, Stability approach to regularization selection (stars) for high dimensional graphical models, Advances in Neural Information Processing Systems 23 (2010) 1432–1440. 17

Page 17 of 22

325

[14] T. V. den Bulcke, K. V. Leemput, B. Naudts, P. van Remortel, H. Ma, A. Verschoren, B. D. Moor, K. Marchal, SynTReN: a generator of syn-

ip t

thetic gene expression data for design and analysis of structure learn-

ing algorithmsan an alternative approach of Gaussian graphical model

330

cr

for biochemical networks, BMC Bioinformatics 7 (43) (2006) 1–12. doi: 10.1186/1471-2105-7-43.

us

[15] T. Schaffter, D. Marbach, D. Floreano, GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods, Bioinformatics 27 (16) (2011) 2263–2270. doi:10.1093/bioinformatics/

335

an

btr373.

[16] R. Tothill, A. Tinker, J. George, R. Brown, S. Fox, D. Johnson, M. Trivett,

M

D. Etemadmogadham, B. Locandro, N. Traficante, S. Fereday, J. Hung, Y. Chiew, I. Haviv, D. Gertig, A. deFazio, D. Bowtell, Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clini-

10.1158/1078-0432.Ccr-08-0196.

te

340

d

cal outcome, Clinical Cancer Research 14 (16) (2008) 5198–5208. doi:

¨ ˙ Batmaz, The new robust CMARS (RCMARS) [17] A. Ozmen, G. W. Weber, I.

Ac ce p

method, in: ISI Proceeding of 24th MEC-EurOPT 2010- Continuous Optimization and Informatics-Based Technologies in the Financial Sector, 2010.

¨ ˙ Batmaz, E. Kropat, RCMARS: Robustifi[18] A. Ozmen, G. W. Weber, I.

345

cation of cmars with different scenarios under polyhedral uncertainty set, Communications in Nonlinear Science and Numerical Simulation 16 (12) (2011) 4780–4787. doi:10.1016/j.cnsns.2011.04.001.

[19] S. Lang, A. Brezger, Bayesian p-splines, Journal of Computational and Graphical Statistics 13 (1) (2004) 183–212. doi:10.1198/1061860043010.

18

Page 18 of 22

Highlights  

Ac ce p

te

d

M

an

us

cr

ip t



A new alternate model for complex biological systems is proposed. A loop-based conic multivariate adaptive regression splines model is developed for the construction of biological networks. The accuracy was evaluated under various simulated and real networks.

Page 19 of 22

Autobiography

cr

ip t

Vilda Purutçuoğlu is the associate professor in the Department of Statistics at Middle East Technical University (METU) and also affiliated faculty in the Informatics Institute, Institute of Applied Mathematics and Department of Biomedical Engineering at METU. She has completed her BSc and MSc in statistics and minor degree in economics. She received her doctorate at the Lancaster University. Purutçuoğlu’s current researches are in the field of bioinformatics, systems biology and biostatistics. She has a research group with doctorates and master students who have been working on deterministic and stochastic modeling of biological networks and their inferences via Bayesian and frequentist theories.

Ac ce p

te

d

M

an

us

Ezgi Ayyıldız is the doctorate student in the Department of Statistics at Middle East Technical University (METU). She has completed her master and undergraduate programme from the same department at METU. In her doctorate thesis, she has been working on different models and inference of biological networks under the supervision of Dr. Purutçuoğlu.

Page 20 of 22

ce

Ac Page 21 of 22

M a ed pt ce Ac Page 22 of 22