Accepted Manuscript Title: Modeling of various biological networks via LCMARS Author: Ezgi Ayyildiz Vilda Purutcuoglu PII: DOI: Reference:
S1877-7503(17)30448-9 https://doi.org/doi:10.1016/j.jocs.2018.08.009 JOCS 911
To appear in: Received date: Revised date: Accepted date:
23-4-2017 15-11-2017 25-8-2018
Please cite this article as: Ezgi Ayyildiz, Vilda Purutc¸uo˘glu, Modeling of various biological networks via LCMARS, (2018), https://doi.org/10.1016/j.jocs.2018.08.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Ezgi Ayyıldıza , Vilda Purut¸cuo˘glua,∗
of Statistics, Middle East Technical University, Ankara, Turkey
cr
a Department
ip t
Modeling of various biological networks via LCMARS
us
Abstract
In system biology, the interactions between components such as genes, proteins, can be represented by a network. To understand the molecular mechanism
an
of complex biological systems, construction of their networks plays a crucial role. However, estimation of these biological networks is a challenging problem because of their high dimensional and sparse structures. Several statistical
M
methods are proposed to overcome this issue. The Conic Multivariate Adaptive Regression Splines (CMARS) is one of the recent nonparametric methods developed for high dimensional and correlated data. This model is suggested
d
to improve the performance of the Multivariate Adaptive Regression Spline
te
(MARS) approach which is a complex model under the generalized additive models. From previous studies, it has been shown that MARS can be a promis-
Ac ce p
ing model for the description of steady-state activations of biological networks if it is modified as a lasso-type regression via the main effects. In this study, we convert the full description of CMARS as a loop-based approach, so-called LCMARS, by including both main and second-order interaction effects since this description has performed better in benchmark real datasets. Here, we generate various scenarios based on distinct distributions and dimensions to compare the performance of LCMARS with MARS and Gaussian Graphical Model (GGM) in terms of accuracy measures via Monte Carlo runs. Additionally, different real biological datasets are used to observe the performance of underlying methods. Keywords: Conic multivariate adaptive regression splines, Gaussian graphical ∗ Corresponding
author. Email address:
[email protected] (Vilda Purut¸cuo˘ glu)
Preprint submitted to Journal of Computational Science
September 4, 2018
Page 1 of 22
ip t
Figure 1: Simple representation of a network with 1000 nodes.
model, sparse networks
cr
1. Introduction
us
The Gaussian Graphical Model (GGM) is most widely used parametric and undirected graphical model that is based on the multivariate normality assumption. This model constructs the graph structure by using the conditional independency between nodes [1]. Under normality, the conditional independency
an
5
means that there is no relation between corresponding nodes which is represented by a zero entry in the precision matrix. In inference on GGM, the lasso
M
regression is applied in such a way that each node is regressed against all remaining nodes, simultaneously and the precision matrix is estimated by using 10
the significant regression coefficients [1]. However, for large systems under high
d
sparsity, the inference of GGM is computationally demanding. Because of this
te
challenge and strict normality assumption, a non-parametric model, so called Multivariate Adaptive Regression Spline (MARS), has been proposed as an al-
Ac ce p
ternative approach to GGM [2]. In the study of Ayyıldız et al. [2], MARS 15
was applied with only the main terms and the performance of both MARS and GGM was compared in terms of the model accuracy and computational cost. From the analyses, it has been shown that MARS results with higher accuracy and gets huge gain in computational time, specifically, under large systems. On the other hand, from comparative analyses, it has been shown that CMARS
20
overperforms MARS in various datasets. Hereby, in this study, we convert the full description of CMARS as a loop-
based approach, so-called LCMARS, similar to the MARS application in biological networks. But different from this model structure, we include both main and second-order interaction effects since this description has performed better 25
in benchmark real datasets as the second-order interactions can be representative of the feed-forward loop in biological networks [3, 4]. In Fig. 1, a simple 2
Page 2 of 22
representation of a network with 1000 nodes is given as an example of the biological networks. Hereby, in this study, we compare LCMARS with MARS and
30
ip t
GGM under both simulated and real datasets. We generate various scenarios
and evaluate the performance of all three methods in terms of accuracy mea-
cr
sures via Monte Carlo runs. Accordingly, in the rest of the paper, we present the
following organization. In Section 2, GGM, MARS and CMARS are explained.
Finally, we conclude the findings in Section 4.
2. Methods
an
35
us
Then, we represent the application and the results of the comparative analyses.
Gaussian Graphical Model
M
The Gaussian Graphical Model (GGM) is a widely used undirected graphical model whose states are described as the multivariate normal distribution. In GGM, the structure of the graph is constructed by using the conditional independencies of variables under the normality assumption. Therefore, if two
d
40
nodes, i.e., variables, in the graph are conditionally independent, no edge is
te
drawn between these nodes and the associated entry in the precision matrix which is the inverse of the variance-covariance matrix is set to zero [1].
Ac ce p
In GGM, a regression model is build for each node against all remaining
45
nodes as described below [5]. Y(p) = βY(−p) + ε,
(1)
where Y(p) and Y(−p) denote the state of the pth node and the states of all nodes
except the pth node, respectively. In this model, Y(p) and Y(j) (j = 1, 2, ..., p −
1) are conditionally independent when the associated regression coefficient βj equals to zero.
50
Multivariate Adaptive Regression Spline Multivariate Adaptive Regression Spline (MARS) is a nonparametric regression method which is suitable for high-dimensional and correlated data under nonlinearity [6]. It can be seen as a generalization of stepwise linear regression
3
Page 3 of 22
Figure 2: Simple representation of the smoothing method for the curvature structure via basis
ip t
functions of MARS.
model which has the following form [6]. f (y) = β0 +
M X
βm ψm (y),
m=1
(2)
us
55
cr
[7]. In MARS, the piecewise linear basis functions (BFs) are used to obtain a
in which each ψm (y) is a function in a set of basis functions and M shows the
t − y, if y < t and t − y + = . 0, otherwise (3)
otherwise
In Eq. (3), the (+) sign indicates the positive part in the expression and t is called the knot. The functions y − t + and t − y + are able to smooth a curve
d
60
if y > t
M
y − t, y−t + = 0,
an
number of basis functions. Accordingly, the piecewise linear basis functions y − t + and t − y + are defined as follows.
te
by pieces of linear expressions as shown in Fig. 2. In the computation of MARS model, this method has two-stage procedure,
Ac ce p
namely, forward stage and backward stage. In the forward stage, the basis functions are iteratively added to the model which has initially a sole intercept
65
term. But the final model at the end of this stage typically overfits the data [8]. To solve this challenge, the backward stage is applied by removing the term whose elimination causes the smallest increase in the residual squared error. In order to get the best estimated model with an optimal number of terms, λ, the generalized cross-validation value (GCV) is calculated [6, 7]. The mathematical
70
expression of GCV is presented as below. PN GCV(λ) =
− fˆλ (yi ))2 . (1 − M (λ)/N )2 i=1 (yi
(4)
In Eq. (4), fˆλ is the best estimated model with the λ number of terms. M (λ) stands for the effective number of parameters and N represents the number of 4
Page 4 of 22
observations. Here, M (λ) can be found via M (λ) = r + cK, where r denotes the number of linearly independent basis functions and K describes the number of selected knots during the forward stage. Finally, c is the cost for the optimization
ip t
75
of the basis function and the smoothing parameter of the model. This value is
cr
typically equated to c = 3, whereas, if the model is restricted as an additive model, it is set to c = 2 [9]. Hence, the model which minimizes GCV is chosen
80
us
as a final model in the backward stage. Conic Multivariate Adaptive Regression Spline
Conic Multivariate Adaptive Regression Spline (CMARS) is a modified ver-
an
sion of MARS to overcome the problem of overfitting in MARS [10]. Thus, once a complex model is found in the forward stage of MARS, the penalized residual sum of squares (PRSS) presented in Eq. (5) is computed for the parameter estimation in CMARS. PRSS is described as below [11]: M N 2 max X X X X Z 2 α (yi −f (x¯i ))2 + λm θm [Dr,s ψm (tm )]2 dtm . (5) PRSS = m=1
|α|=1
r
Qm
d
i=1
M
85
In this expression, λm stands for the penalty parameters, V (m) is the vari-
te
able set associated with the mth basis function ψm , tm denotes the vector of variables which contribute to the mth basis function and thus, shows the inte-
Ac ce p
gration variable of the complexity term, and finally, Qm denotes a sufficiently 90
large parallelpipe where the integration takes place, i.e., where the input data are located. The optimization problem in Eq. (5), which controls both the accuracy and the complexity of the model, can be also rearranged as a Tikhonov regularization (TR) as the following form: ¯ 2 + λkLθk2 . PRSS ≈ ky − ψ(d)θk 2 2
(6)
In Eq. (6), L implies a diagonal (Mmax + 1) × (Mmax + 1)-matrix and θ refers
95
to an ((Mmax + 1) × 1)-dimensional vector of parameters and k · k22 denotes the
L2 -norm of the given term. By this way, the conic quadratic programming can be performed to solve the regularization problem in Eq. (5). In this study, we construct the loop-based CMARS model similar to a regression model as given in Eq. (1), and call as loop-based CMARS, shortly 5
Page 5 of 22
100
LCMARS. Thereby, we regress each node in the graph against all the remaining nodes iteratively and estimate the regression coefficients to detect the relation
ip t
between corresponding nodes. In this calculation, the steps of LCMARS method can be listed as below:
105
interaction effects by using the forward stage of MARS.
cr
(1) Construct p number of models which include both main and second-order
us
(2) Compute PRSS for each model without moving the backward stage of MARS.
(3) Determine the best model which has the optimal penalty term.
110
an
(4) For each selected LCMARS model, assign 1 to associated entry of the adjacency matrix if there is a relation between the response and explanatory
M
variables.
(5) Apply the AND rule to build symmetric adjacency matrix. Here, the AND rule means that we assign 1 for each pair of nodes (i, j) in the precision
115
d
matrix if both (i, j) and (j, i) entries in Θ indicate the edges. (6) Make comparison between the estimated adjacency matrix and the true
te
binary precision matrix to compute accuracy measures, which we choose as
Ac ce p
recall, F-measure, Jaccard index and accuracy value.
3. Application
In the application, we perform LCMARS, MARS and GGM methods to
120
construct the networks of simulated and real biological datasets. We implement LCMARS by using the main effects and the second-order interactions of the model. The significant regression coefficients of this model are directly taken from the estimation of the adjacency matrix which is used to indicate the undirected link between nodes in a network. Here, the estimated adjacency matrix
125
can be nonsymmetric since the response variables in each regression model have different significant explanatory variables. So we apply the AND rule which means that we obtain an entry 1 when both (i, j) and (j, i) entries of the estimated adjacency matrix are one. By using the AND rule, we infer more sparse 6
Page 6 of 22
matrices similar to the real biological networks. In the end, we compare the 130
symmetric adjacency matrix with the true network and evaluate the different
ip t
accuracy measures such as the recall, F-measure, Jaccard index and accuracy.
equations.
F-measure
=
Jaccard index
=
Accuracy
=
TP , TP+FN 2TP , 2TP + FP + FN TP , TP+FP+FN TP+TN . TP+TN+FP+FN
us
=
an
Recall
cr
The mathematical description of each measure is presented in the following
(7) (8) (9)
(10)
135
M
In these expressions, the true positive (TP) implies the number of correctly classified objects that have positive labels, the true negative (TN) indicates the number of correctly classified objects that have negative labels, the false positive
d
(FP) shows the number of misclassified objects that have negative labels, and
te
the false negative (FN) presents the number of misclassified objects that have positive labels. Here, recall represents the ratio of correctly estimated positive links (TP) over the total number of links in the true network. In other words,
Ac ce p
140
it only deals with the number of TP and does not pay attention to the number of TN. On the contrary, the accuracy is the ratio of correctly classified genes in both labels to the total of all classified genes. So it evaluates the performance of methods from both sides. Additionally, in all these measures, 1 indicates
145
perfect accuracies, whereas, zero implies no success at all in terms of accuracy. 3.1. Application of Simulated Study The simulated datasets are created under mainly two different scenarios such that the generated matrices come from normal and non-normal distributions. The multivariate normally distributed datasets are generated under the scale-
150
free feature by using the huge package in the R programming language. For nonnormal datasets, we use different distributions such as Student-t, log-normal, 7
Page 7 of 22
Cauchy, and mixture of log-normal and normal distribution. For Student-t distribution, we use two different degrees of freedom (df), namely, df=7 and
155
ip t
df=15, so that we can detect the results of the Student-t and the results of its more normally distributed versions, respectively. The outputs of LCMARS,
cr
MARS and GGM methods under all scenarios are presented in Tables 1 - 5.
Furthermore, for all schemes, datasets are simulated under various dimen-
us
sions, p, listed as 40, 100 and 200 number of genes in the system. Moreover, the number of observations for each dataset equals to 20 and all the results are 160
based on the 1000 Monte Carlo runs. Finally in all scenarios, the topology of
an
the networks are chosen as the scale-free networks as this is the most common view of the biological networks [12].
On the other hand, we perform the graphical lasso, also called glasso, method
165
M
[9] to infer the model parameters in GGM. This method basically finds the entries of the precision matrix via the penalized likelihood function whose penalty constant is controlled by the L1 -norm. In this study, the optimal penalty con-
d
stant is selected by the STARS criterion [13].
te
From the analyses, we observe that for normal distribution, recall and accuracy values of both methods are close to each other under all dimensions. On the other side, F-measure and Jaccard index values of LCMARS are greater than
Ac ce p
170
GGM. Additionally, as seen in Table 1, when the dimension of data increases, the differences between methods get larger. Additionally, as observed in Table 2, under the Student-t distribution for all
dimensions and both degrees of freedom (df=7 and df=15), F-measure, Jaccard
175
index and accuracy values of LCMARS are higher than both MARS and GGM. Similar to the previous result, GGM gets worse when the dimension increases. In Table 3, we observe that the accuracy, F-measure and Jaccard index
values of LCMARS under log-normal datasets are a bit more than GGM when the dimension of the network is small, i.e., p = 40. Whereas, the difference 180
between LCMARS and GGM significantly grows for larger dimensions. Besides these, recall values of GGM are higher than LCMARS for high dimensional systems. Moreover, the accuracy, F-measure and Jaccard index values of MARS 8
Page 8 of 22
Table 1: Comparison of the recall, F-measure, Jaccard index and accuracy for Normally distributed data. p is the total number of genes in the system. The best models are shown in
MARS
Jaccard index
Accuracy
40
0.3483
0.4608
0.2995
0.9400
100
0.3392
0.4492
0.2896
200
0.3359
0.4455
0.2869
40
0.3539
0.4335
100
0.3462
0.3773
200
0.3426
0.3264
40
0.3502
100
0.3388
200
0.3571
cr
F-measure
0.9752
0.9876
us
GGM
Recall
0.2773
0.9309
0.2330
0.9655
an
LCMARS
p
0.1958
0.9782
0.4505
0.2910
0.9369
0.4556
0.2949
0.9759
0.1105
0.9570
M
Method
ip t
boldface.
0.1990
d
are close to LCMARS under small and moderate systems, i.e., p = 40 and 100,
185
te
respectively. On the other side, the difference between LCMARS and MARS get larger under bigger systems when we specifically compare F-measure and
Ac ce p
Jaccard index.
Finally, the results of the Cauchy distribution and mixture of log-normal
with normal distributions, are similar to the previous results as they can be seen in Table 4 and Table 5, respectively. In other words, LCMARS produces
190
higher scores than MARS and GGM in terms of the accuracy, F-measure and Jaccard index. Furthermore, although recall values of GGM are higher than LCMARS and MARS, the differences get smaller when the dimension increases. Hence, from all findings we detect that apart from small systems, LCMARS
performs better than both MARS and GGM under all scenarios and dimensions.
195
Additionally, when we compare the results of LCMARS and MARS, it can be observed than although the results are close to each other under small networks, LCMARS stands out from MARS for large networks. Therefore, we consider that the suggested approach can be a promising alternative of GGM in the 9
Page 9 of 22
Table 2: Comparison of the recall, F-measure, Jaccard index and accuracy for Student-t distributed data with degrees of freedom (df) 7 and 15. p is the total number of genes in the
MARS
LCMARS
Jaccard index
40
0.3379
0.4468
0.2882
100
0.3346
0.4425
0.2844
0.9747
200
0.3339
0.4426
0.2846
0.9873
40
0.3895
0.3117
0.1853
0.8676
100
0.3780
0.2064
0.1155
0.9090
200
0.3744
0.1335
0.0718
0.9238
40
0.3422
0.4358
0.2790
0.9335
100
0.3349
0.4403
0.2826
0.9745
200
0.3540
0.1996
0.1109
0.9574
40
0.3382
0.4468
0.2883
0.9374
100
0.3347
0.4425
0.2844
0.9747
0.3341
0.2846
0.9873
MARS
cr
0.9374
40
0.3972
0.2971
0.1749
0.8564
100
0.3833
0.1909
0.1058
0.8998
200
0.3792
0.1249
0.0668
0.9179
40
0.3415
0.4353
0.2786
0.9335
100
0.3351
0.4409
0.2830
0.9745
200
0.3544
0.1997
0.1109
0.9574
Ac ce p
GGM
Accuracy
0.4428
200
15
us
GGM
F-measure
an
7
Recall
M
LCMARS
p
d
Method
te
df
ip t
system. The best models are shown in boldface.
construction of complex biological networks. In Fig. 4, we shown the number of
200
success for each model based on each accuracy measure. The plot also indicates that LCMARS is more accurate than other models for all selected measures. 3.2. Application of Biological Data To evaluate the performance of methods under biological data, we use different synthetic and real benchmark datasets. In this application, the synthetic 10
Page 10 of 22
Table 3: Comparison of the recall, F-measure, Jaccard index and accuracy for Log-Normally distributed data. p is the total number of genes in the system. The best models are shown in
MARS
Jaccard index
Accuracy
40
0.3559
0.4615
0.3000
0.9388
100
0.3391
0.4534
0.2931
200
0.3357
0.4565
0.2961
40
0.3390
0.5063
100
0.4664
0.1198
200
0.4215
0.0851
40
0.3390
100
0.3397
200
0.3586
cr
F-measure
0.9756 0.9880
us
GGM
Recall
0.3390
0.9512
0.0637
0.7950
an
LCMARS
p
0.0444
0.8643
0.4600
0.2990
0.9410
0.4424
0.2840
0.9745
0.1061
0.9548
M
Method
ip t
boldface.
0.1918
d
Table 4: Comparison of the recall, F-measure, Jaccard index and accuracy for Cauchy distributed data. p is the total number of genes in the system. The best models are shown in
p
Recall
Ac ce p
Method
te
boldface.
LCMARS
GGM
MARS
F-measure
Jaccard index
Accuracy
40
0.3487
0.4563
0.2958
0.9388
100
0.3385
0.4588
0.2976
0.9762
200
0.3351
0.4694
0.3070
0.9889
40
0.7413
0.1585
0.0861
0.4191
100
0.5532
0.0906
0.0475
0.6689
200
0.4468
0.0708
0.0367
0.8243
40
0.3552
0.4295
0.2737
0.9303
100
0.3395
0.4422
0.2839
0.9745
200
0.3570
0.2038
0.1135
0.9583
11
Page 11 of 22
Table 5: Comparison of the recall, F-measure, Jaccard index and accuracy for mixture of log-normal and normal data. p is the total number of genes in the system. The best models
MARS
Accuracy
40
0.3471
0.4621
0.3005
0.9405
100
0.3388
0.4586
0.2974
200
0.3355
0.4593
0.2984
40
0.4593
0.2505
100
0.4112
0.1631
200
0.3816
0.1264
40
0.3548
100
0.3403
200
0.3564
cr
Jaccard index
0.9762 0.9881
0.1435
0.7963
0.0888
0.8739
0.0675
0.9211
0.4322
0.2759
0.9312
0.4370
0.2796
0.9739
0.1140
0.9586
0.2047
data are taken from benchmark synthetic data tools such as SynTRen (Synthetic
d
205
F-measure
us
GGM
Recall
an
LCMARS
p
M
Method
ip t
are shown in boldface.
te
transcriptional regulatory networks [14] and GeneNetWeaver (GNW) [15]. The synthetic gene expression datasets from these tools approximate the experimen-
Ac ce p
tal data and enable as to validate different network construction methods as they present the true network model too.
210
Accordingly, in this study, the gene expression datasets from DREAM 4
(2009) in-silico size 10 and size 100 are used. DREAM is the international Dialogue for Reverse Engineering Assessments and Methods competition which publishes network inference challenges. Hereby, the DREAM 4 in-silico size 10 challenge consists of five networks involving 10 genes. Moreover, in-silico
215
size 100 challenge includes five networks involving 100 genes. For each network, multifactorial data are available which contain steady-state levels of variations in the network based on gene expression characteristics of two well-studied systems, namely, E.coli and S.cerevisiae. These datasets are accessible from the R package DREAM4. Hence, in our analyses, we use three datasets from in-silico size 10
220
and two datasets from in-silico size 100. The results of in-silico size 10 and 12
Page 12 of 22
size 100 networks are presented in Table 6 and Table 7, respectively. From the findings, it is observed that LCMARS has better scores than MARS and GGM in
ip t
terms of recall, F-measure and Jaccard index for two in-silico size 10 datasets. Moreover, the accuracy values of MARS are higher than other methods for
these two datasets. Additionally, for the third data, LCMARS is better under
cr
225
all accuracy measures. On the other hand, in-silico size 100 datasets, accuracy
us
results of LCMARS and MARS methods are very close to each other. However, the GGM method cannot be applicable since this method cannot work when the system’s elements have high correlation.
an
Table 6: Comparison of the recall, F-measure, Jaccard index and accuracy for DREAM-4
Jaccard index
Accuracy
LCMARS
0.647
0.647
0.478
0.760
MARS
0.471
0.616
0.444
0.800
GGM
0.294
0.454
0.294
0.760
LCMARS
0.474
0.563
0.391
0.720
3
230
F-measure
MARS
0.368
0.538
0.368
0.760
GGM
0.263
0.416
0.263
0.720
Ac ce p
2
Recall
d
1
Method
te
Data
M
in-silico size 10. The best models are shown in boldface.
LCMARS
0.438
0.583
0.412
0.800
MARS
0.375
0.522
0.353
0.780
GGM
0.312
0.476
0.312
0.780
Furthermore, we compare the performance of three methods with a simulated
toy example of gene expression data generated by the GNW generator by using known biological interaction networks of E.coli. The toy data contain 64 samples and 64 genes and are available in the R package grndata. We present the outcomes of this dataset in Table 8. From the outcomes, we observe that recall,
235
F-measure and Jaccard index of LCMARS are better than others. On the other side, the accuracy values of GGM is slightly higher than LCMARS and it is
13
Page 13 of 22
Table 7: Comparison of the recall, F-measure, Jaccard index and accuracy for DREAM-4
2
Recall
F-measure
Jaccard index
Accuracy
LCMARS
0.275
0.296
0.174
0.943
MARS
0.248
0.298
0.175
LCMARS
0.234
0.267
0.154
MARS
0.207
0.258
0.148
0.949
cr
1
Method
0.934
us
Data
ip t
in-silico size 100. The best models are shown in boldface.
an
very close to the MARS result.
0.939
Table 8: Comparison of the recall, F-measure, Jaccard index and accuracy for toy data and E-GEOD-9891. The best models are shown in boldface.
Method LCMARS
F-measure
Jaccard index
Accuracy
0.360
0.347
0.210
0.842
MARS
0.234
0.342
0.207
0.895
0.134
0.236
0.134
0.899
LCMARS
0.703
0.825
0.703
0.703
MARS
0.306
0.469
0.306
0.306
GGM
0.091
0.167
0.091
0.091
Ac ce p
E-GEOD-9891
GGM
te
d
Toy
Recall
M
Data
Finally, as a real biological data, we apply the transcription profiling of 285
human ovarian tumour samples named as E-GEOD-9891. The data are cohorts
240
of 285 patients with epithelial ovarian, primary peritoneal, or fallopian tube cancer, diagnosed between 1992 and 2006. The samples are collected through Australian Ovarian Cancer Study with a sample size (n=206), Royal Brisbane Hospital (n=22), Westmead Hospital (n=54) and Netherlands Cancer Institute (n=3) [16]. In this dataset, there are 11 target genes which are chosen to identify
245
the novel molecular subtypes of the ovarian cancer by the gene expression profiling with a linkage to clinical and pathological features [16]. According to the results in Table 8, it is seen that LCMARS performs significantly better than 14
Page 14 of 22
(b)
(c)
(d)
ip t
(a)
Figure 3: (a) True network of the E-GEOD9891 data, (b) estimated network via LCMARS,
cr
(c) estimated network via MARS and (d) estimated network via GGM.
Figure 4: Bar charts of the number of successive models for simulated (dark blocks) and real
us
(light blocks) datasets based on recall, accuracy, F-measure and Jaccard index.
other methods in terms of all measures. Thereby, as shown in Fig. 3, LCMARS
250
an
can detect 37 links out of 55 links of the true network. Whereas, MARS can only capture 13 links in the same 55 links. On the other side, GGM cannot
M
detect any link, i.e., edges between genes. We summarize the performance of all models in Fig. 4, similar to simulated data analyses, in order to count the
255
te
4. Conclusion
d
success of each model.
In this study, we have proposed CMARS, a nonparametric regression model,
Ac ce p
as an alternative modelling of GGM to construct the biological networks. Because from the previous studies [2], we have observed that the performance of MARS, which is also suggested as competitive method of GGM, and is seen as the basic modelling approach behind CMARS, is better than GGM in terms of
260
accuracy and computational time. Here, we have implemented the loop-based CMARS, called LCMARS, by including main and second-order interaction effects in the regression model. To compare the performance of LCMARS with MARS and GGM, we have conducted simulation studies under distinct distributions and dimensions. Also, we applied different synthetic and real biolog-
265
ical benchmark datasets. From the comparative analyses, we have seen that LCMARS has higher accuracy than both MARS and GGM under both simulated and real datasets under non-normal and normal distribution. Furthermore, in real data, LCMARS stands out from MARS and GGM methods. Therefore, 15
Page 15 of 22
we have concluded that LCMARS can be a strong alternative to GGM since 270
LCMARS overcomes the limitations of GGM such as strict normality assump-
ip t
tion and low accuracy under large systems.
As the future work of the study, we consider to adapt this approach to robust
cr
version of CMARS, so-called RCMARS [17]. The RCMARS model is shortly
suggested to detect the model which has the lowest false positive rate [18] and resulting in to find the minimum number of links in the system. We think that
us
275
this new model can be converted as a lasso-model similar to LCMARS in order to construct the major core subnetworks. Furthermore, we consider that the
an
description of the spline functions in MARS, CMARS and RCMARS model can be extended by covering not only the linear relation, but also non-linear 280
relationship between genes. For this purpose, we think to perform p-splines
M
approach [19] which has unspecified smoothing functions in modelling. Acknowledgement
The authors thank the BAP project (no: BAP-01-09-2016-002) at Middle
hard Wilhelm Weber for his valuable discussion in different stages of the study.
te
285
d
East Technical University for its support. They also thank to Prof. Dr. Ger-
Finally, they thank to anonymous referee and editor for their valuable comments
Ac ce p
which improve the quality of the paper.
References
[1] J. Whittaker, Graphical Models in Applied Multivariate Statistics, John
290
Wiley and Sons, New York, 1990.
[2] E. Ayyıldız, M. A˘ graz, V. Purut¸cuo˘glu, MARS an an alternative approach of gaussian graphical model for biochemical networks, Journal of Applied Statistics 0 (2016) 1–19. doi:10.1080/02664763.2016.1266464.
[3] A. L. Barabasi, Z. N. Oltvai, Network biology: Understanding the cell’s 295
functional organization, Nature Reviews Genetics 5 (2004) 101–113. doi: 10.1038/nrg1272.
16
Page 16 of 22
[4] B. H. Junker, F. Schreiber, Analysis of biological networks, John Wiley
ip t
and Sons, Inc., Hoboken, New Jersey, 2008. [5] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of 300
the Royal Statistical Society 58 (1) (1996) 267–288.
cr
[6] J. H. Friedman, Multivariate adaptive regression splines, The Annals of
us
Statistics 19 (1) (1991) 1–67.
[7] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference and prediction, Springer Verlag, New York, 2009.
an
305
[8] A. R. Barron, X. Xiao, Discussion:
multivariate adaptive regression
splines, The Annals of Statistics 19 (1) (1991) 67–82. doi:10.1214/aos/
M
1176347964.
[9] J. H. Friedman, T. Hastie, R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (3) (2008) 432–441.
d
310
te
doi:10.1093/biostatistics/kxm045. ˙ Batmaz, G. W. Weber, Modeling, Dynamics, Optimization [10] F. Yerlikaya, I.
Ac ce p
and Bioeconomics, Vol. 37, Springer International Publishing, 2014, Ch. A review and new contribution on conic multivariate adaptive regression
315
splines (CMARS): A powerful tool for predictive data mining, pp. 695–722.
[11] P. Taylan, G. W. Weber, F. Yerlikaya, A new approach to multivariate adaptive regression splines by using tikhonov regularization and continuous optimization, TOP 18 (2) (2010) 377–395.
doi:10.1007/
s11750-010-0155-7.
320
[12] U. Alon, An introduction to systmes biology: Design princple of biological circuits, Chapman and Hall CRC, 2007. [13] H. Liu, K. Roeder, L. Wasserman, Stability approach to regularization selection (stars) for high dimensional graphical models, Advances in Neural Information Processing Systems 23 (2010) 1432–1440. 17
Page 17 of 22
325
[14] T. V. den Bulcke, K. V. Leemput, B. Naudts, P. van Remortel, H. Ma, A. Verschoren, B. D. Moor, K. Marchal, SynTReN: a generator of syn-
ip t
thetic gene expression data for design and analysis of structure learn-
ing algorithmsan an alternative approach of Gaussian graphical model
330
cr
for biochemical networks, BMC Bioinformatics 7 (43) (2006) 1–12. doi: 10.1186/1471-2105-7-43.
us
[15] T. Schaffter, D. Marbach, D. Floreano, GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods, Bioinformatics 27 (16) (2011) 2263–2270. doi:10.1093/bioinformatics/
335
an
btr373.
[16] R. Tothill, A. Tinker, J. George, R. Brown, S. Fox, D. Johnson, M. Trivett,
M
D. Etemadmogadham, B. Locandro, N. Traficante, S. Fereday, J. Hung, Y. Chiew, I. Haviv, D. Gertig, A. deFazio, D. Bowtell, Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clini-
10.1158/1078-0432.Ccr-08-0196.
te
340
d
cal outcome, Clinical Cancer Research 14 (16) (2008) 5198–5208. doi:
¨ ˙ Batmaz, The new robust CMARS (RCMARS) [17] A. Ozmen, G. W. Weber, I.
Ac ce p
method, in: ISI Proceeding of 24th MEC-EurOPT 2010- Continuous Optimization and Informatics-Based Technologies in the Financial Sector, 2010.
¨ ˙ Batmaz, E. Kropat, RCMARS: Robustifi[18] A. Ozmen, G. W. Weber, I.
345
cation of cmars with different scenarios under polyhedral uncertainty set, Communications in Nonlinear Science and Numerical Simulation 16 (12) (2011) 4780–4787. doi:10.1016/j.cnsns.2011.04.001.
[19] S. Lang, A. Brezger, Bayesian p-splines, Journal of Computational and Graphical Statistics 13 (1) (2004) 183–212. doi:10.1198/1061860043010.
18
Page 18 of 22
Highlights
Ac ce p
te
d
M
an
us
cr
ip t
A new alternate model for complex biological systems is proposed. A loop-based conic multivariate adaptive regression splines model is developed for the construction of biological networks. The accuracy was evaluated under various simulated and real networks.
Page 19 of 22
Autobiography
cr
ip t
Vilda Purutçuoğlu is the associate professor in the Department of Statistics at Middle East Technical University (METU) and also affiliated faculty in the Informatics Institute, Institute of Applied Mathematics and Department of Biomedical Engineering at METU. She has completed her BSc and MSc in statistics and minor degree in economics. She received her doctorate at the Lancaster University. Purutçuoğlu’s current researches are in the field of bioinformatics, systems biology and biostatistics. She has a research group with doctorates and master students who have been working on deterministic and stochastic modeling of biological networks and their inferences via Bayesian and frequentist theories.
Ac ce p
te
d
M
an
us
Ezgi Ayyıldız is the doctorate student in the Department of Statistics at Middle East Technical University (METU). She has completed her master and undergraduate programme from the same department at METU. In her doctorate thesis, she has been working on different models and inference of biological networks under the supervision of Dr. Purutçuoğlu.
Page 20 of 22
ce
Ac Page 21 of 22
M a ed pt ce Ac Page 22 of 22