6th IFAC Conference on Foundations of Systems Biology in Engineering 6th IFAC Conference on 6th IFAC 9-12, Conference on Available online at www.sciencedirect.com October 2016. Magdeburg, Germany Foundations of Systems Foundations of Systems Biology Biology in in Engineering Engineering October October 9-12, 9-12, 2016. 2016. Magdeburg, Magdeburg, Germany Germany
ScienceDirect
IFAC-PapersOnLine 49-26 (2016) 147–152
Gene Regulatory Network Inference Using Time-Stamped Cross-Sectional Gene Inference Using Cross-Sectional Gene Regulatory Regulatory Network Network Inference Using Time-Stamped Time-Stamped Cross-Sectional Single Cell Expression Data Single Cell Cell, Expression Expression Data Data , Single Nan Papili Gao* ** S. M. Minhaz Ud-Dean* ** , ,1 , ,, M. Ud-Dean* ** Nan Papili Gao* ** S. S.Gunawan* M. Minhaz Minhaz,** Ud-Dean* ** Nan PapiliRudiyanto Gao*,** ,1 , ,1 Rudiyanto ** Rudiyanto Gunawan* Gunawan* **
*Institute for Chemical and Bioengineering, ETH Zurich, Zurich, Switzerland *Institute for Chemical and Bioengineering, Zurich, Zurich, Institute Bioinformatics,ETH Lausanne, *Institute** forSwiss Chemical andof Bioengineering, ETH Zurich,Switzerland Zurich, Switzerland Switzerland ** Swiss Swiss Institute Institute of of Bioinformatics, Bioinformatics, Lausanne, Lausanne, Switzerland Switzerland **
Abstract: In this paper we presented a novel method for inferring gene regulatory network (GRN) from time-stamped cross-sectional single cell data. method Our strategy, called SNIFS (Sparse Network For Abstract: In this this paper aa novel for gene network (GRN) Abstract: In paper we we presented presented novel method for inferring inferring gene regulatory regulatory networkInference (GRN) from from Single cell data) seeks to recover the causal relationships among genes by analyzing the evolution of the time-stamped cross-sectional single cell data. Our strategy, called SNIFS (Sparse Network Inference For time-stamped cross-sectional single cell data. Our strategy, called SNIFS (Sparse Network Inference For distribution of gene expression levels over time, more specifically using Kolmogorov-Smirnov (KS) Single cell data) seeks to recover the causal relationships among genes by analyzing the evolution of the Single cell data) seeks to recover the causal relationships among genes by analyzing the evolution of the distance. In of thegene proposed method, we over formulated the GRN inference as aKolmogorov-Smirnov linear regression problem, distribution expression levels time, specifically using (KS) distribution of gene expression levels over time, more more specifically using Kolmogorov-Smirnov (KS) where we used Lasso regularization to obtain the optimal sparse solution. We tested SNIFS using in distance. In the proposed method, we formulated the GRN inference as a linear regression problem, distance. In the proposed method, we formulated the GRN inference as a linear regression problem, silico single cell data from 10and 20-gene GRNs, and compared the performance of our method with where we used Lasso regularization to obtain the optimal sparse solution. We tested SNIFS using in where we used Lasso regularization to obtain the optimal sparse solution. We tested SNIFS using in Time Series GEneGRNs, Network of trees (GENIE3), and silico single cell from and and compared the performance of method silico single Network cell data data Inference from 1010- (TSNI), and 20-gene 20-gene GRNs, andInference comparedwith the Ensemble performance of our our method with with an extension of GENIE3 for time series data called JUMP3. The results showed that SNIFS Time Time Series Series Network Network Inference Inference (TSNI), (TSNI), GEne GEne Network Network Inference Inference with with Ensemble Ensemble of of trees trees (GENIE3), (GENIE3), and and outperformed existing algorithms based on data the Area the Receiver Operating an extension GENIE3 for series called JUMP3. The showed that an extension of of GENIE3 for time time series data calledUnder JUMP3. The results results showedCharacteristic that SNIFS SNIFS (AUROC) and existing Area Under the Precision-Recall (AUPR) outperformed algorithms based Area Under outperformed existing algorithms based on on the the Area curves. Under the the Receiver Receiver Operating Operating Characteristic Characteristic (AUROC) and Area Under the Precision-Recall (AUPR) curves. (AUROC) and Area Under the Precision-Recall (AUPR) curves. © 2016, IFAC (International Federation of Automatic Control)gene Hosting by Elsevier Ltd. All rights reserved. Keywords: network inference, single cell, gene expression, regulatory network Keywords: network inference, single cell, gene expression, gene regulatory network Keywords: network inference, single cell, gene expression, gene regulatory network 1. INTRODUCTION 1. INTRODUCTION 1. INTRODUCTION What is the gene regulatory network underlying the stochastic cell fate decision-making process? What is dynamics the gene gene inregulatory regulatory network underlying underlying the What is the network the How can we control such a process to give a desired stochastic dynamics in cell fate decision-making process? stochastic dynamics in cell fate decision-making process? outcome? answers such to these questions have aaimportant How can can The we control control such process to give give desired How we aa process to desired implications in understanding and manipulating the cell outcome? The outcome? The answers answers to to these these questions questions have have important important differentiation and moreand broadly, in providing implications in inprocess, understanding and manipulating the new cell implications understanding manipulating the cell appreciation for the role of cell-to-cell variability in the differentiation process, and more broadly, in providing new differentiation process, and more broadly, in providingfields new of biology and 2013). Thanks to recent appreciation formedicine the role role of of(Sandberg cell-to-cell variability in the the fields appreciation for the cell-to-cell variability in fields advances in high-throughput cell profiling technologies, such of biology and medicine (Sandberg 2013). Thanks to recent of biology and medicine (Sandberg 2013). Thanks to recent as RNA-sequencing and real-time PCR (polymerase chain advances in high-throughput cell profiling technologies, such advances in high-throughput cell profiling technologies, such reaction), and in microfluidic devices for single cell handling, as RNA-sequencing and real-time PCR (polymerase chain as RNA-sequencing and real-time PCR (polymerase chain researchers now have the ability to obtain the transcriptional reaction), and in microfluidic microfluidic devices for single single cell handling, handling, reaction), and in devices for cell expression profile of a large set of genes for single cells. For researchers to obtain the transcriptional researchers now now have have the the ability ability to obtain the transcriptional ® allows the measurements of For the example, Biomark expression profile of of large set set of genes genes for single single cells. cells. For expressionFluidigm profile aa large of for ® ® allows expression levels of 96 genes in 96 single cells in a single the measurements of example, Fluidigm Biomark the example, Fluidigm Biomark allows the measurements of the assay (Pieprzyk High 2009). device expression levels and of 96 96 genes in 96 96This single cells and in aaseveral single expression levels of genes in single cells in single other celland assay platforms offer device much and promise in assay single (Pieprzyk High 2009). several assay (Pieprzyk and High 2009). This This device and several overcoming the limitations of population-averaged other single cell assay platforms offer much promise other single cell assay platforms offer much promise in in measurements. In particular, averages lack the overcoming limitations of population-averaged population-averaged overcoming the the limitationspopulation of necessary information for understanding measurements. In population averages lack measurements. In particular, particular, populationbiological averagesprocesses, lack the the such as cell-fate determination, where cell-to-cell necessary processes, necessary information information for for understanding understanding biological biological variability processes, and play where important functional roles such as determination, where cell-to-cell variability such stochastic as cell-fate cell-fate dynamics determination, cell-to-cell variability (Stegle et al. 2015). and stochastic dynamics play important functional roles and stochastic dynamics play important functional roles (Stegle et al. 2015). (Stegle et al. 2015). 1 1 1
Corresponding author:
[email protected] Corresponding author: Corresponding author:
[email protected] [email protected]
Many of the existing technologies for quantifying single cell ® , rely on cell gene expression, such technologies as the Fluidigm Many of the the existing existing for quantifying single Many of technologies for Biomark quantifying single cell ® ® lysate assay, and therefore would not allow for a longitudinal , rely gene expression, such as the Fluidigm Biomark on cell cell gene expression, such as the Fluidigm Biomark , rely on single cell study. analysis of single cell data new lysate and therefore would not for longitudinal lysate assay, assay, and The therefore would not allow allow for aarequires longitudinal computational approaches, as many of the strategies single cell cell study. study. The analysis analysis of single single cell existing data requires requires new single The of cell data new do not explicitly account for and therefore are unable to take computational computational approaches, approaches, as as many many of of the the existing existing strategies strategies advantage of information contained cell-to-cell do not not explicitly explicitly account for for and and thereforeinare arethe unable to take take do account therefore unable to variability. For this reason, a plethora of new computational advantage of information contained in the cell-to-cell advantage of information contained in the cell-to-cell approaches For havethis been introduced recently for the analysis of variability. For this reason, plethora of new new computational variability. reason, aa plethora of computational cross-sectional single cell data (Woodhouse et al. 2016). of In approaches have been introduced recently for the approaches have been introduced recently for the analysis analysis of particular, several methods focused on the reconstruction of cross-sectional single cell data (Woodhouse et al. 2016). cross-sectional single cell data (Woodhouse et al. 2016). In In the GRN using single cell transcriptomics on Boolean particular, several methods focused on on the thebased reconstruction of particular, several methods focused reconstruction of networks et al. 2014; Moignard et al.based 2015),on the GRN GRN (Chen using single single cell transcriptomics transcriptomics based onstochastic Boolean the using cell Boolean modelling (Teles et al. 2013), gene co-expression/correlation networks (Chen (Chen et et al. al. 2014; 2014; Moignard Moignard et et al. al. 2015), stochastic networks 2015), stochastic (Kouno et (Teles al. 2013; Moignard et al.co-expression/correlation 2013; Pina et al. 2015), modelling (Teles et al. al. 2013), gene gene co-expression/correlation modelling et 2013), and nonlinear ODE systems (Ocone al. et 2015). The (Kouno et al. 2013; Moignard et al. 2013;et Pina al. (Kouno et al. 2013; Moignard et al. 2013; Pina et al. 2015), 2015), applications of these methods however faced a few and nonlinear ODE systems (Ocone et al. 2015). The and nonlinear ODE systems (Ocone et al. 2015). The challenges due to, for example, the requirement of dense time applications of these methods however faced a few applications of these methods however faced a few course data and high computational complexity that scales challenges due to, for example, the requirement of dense time challenges due to, for example, the requirement of dense time exponentially withhigh the size of the network. course data data and and high computational complexity that that scales scales course computational complexity exponentially with with the the size size of of the the network. network. exponentially In this work, we proposed a novel method called SNIFS (Sparse Network Single method cell data)called for efficient In this this work, work, weInference proposedFor novel method called SNIFS In we proposed aa novel SNIFS inference of GRN from time-stamped cross-sectional single (Sparse Network Network Inference Inference For For Single Single cell cell data) data) for for efficient efficient (Sparse cell expression data. SNIFS begins with the computation of inference of GRN from time-stamped cross-sectional single inference of GRN from time-stamped cross-sectional single changes in the transcriptional expression distribution over cell expression data. SNIFS begins with the computation of cell expression data. SNIFS begins with the computation of time for each gene, quantified using the Kolmogorovchanges in the transcriptional expression distribution over changes in the transcriptional expression distribution over time for for each each gene, gene, quantified quantified using using the the KolmogorovKolmogorovtime
2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2016, 2016 IFAC 1 Hosting by Elsevier Ltd. All rights reserved. Peer review under responsibility of International Federation of Automatic Control. Copyright © 1 Copyright © 2016 2016 IFAC IFAC 1 10.1016/j.ifacol.2016.12.117
2016 IFAC FOSBE 148 October 9-12, 2016. Magdeburg, GermanyNan Papili Gao et al. / IFAC-PapersOnLine 49-26 (2016) 147–152
data are only available at non-uniform time intervals, one could use interpolation algorithms (e.g. spline fitting) to generate artificial KS distances with uniform time meshes. The basic premise of the GRN inference in SNIFS is that the KS distance of a gene at a given time step Δtk is proportional to the KS distances of its regulators from the previous time step Δtk-1, where the proportionality coefficients indicate the strength of the gene regulations. Following this ansatz, we formulated the GRN inference as a multivariate linear regression problem, as follows: (see also Fig. 1C).
Smirnov (KS) distribution distances between two subsequent time points. The GRN inference involves finding a sparse linear model that describes the KS distance of a gene in a given time step as a function of the KS distances of all other genes in the previous time step. We adopted the Lasso regularization to find the optimal sparse solution (Tibshirani 1996). We tested the performance of SNIFS to predict random subnetworks of E. coli and yeast GRN from in silico crosssectional single cell transcriptional expression data. We compared the performance of SNIFS to that of a network inference method for time series (population-averaged) expression data TSNI (Time Series Network Inference) (Bansal et al. 2006), to GENIE3 (GEne Network Inference with Ensemble of trees) that uses a tree-based ensemble regression method (Huynh-Thu et al. 2010), and to JUMP3, a hybrid network inference algorithm that combines decisiontrees with dynamical modeling for time series data (HuynhThu and Sanguinetti 2015). We showed that by employing for the time evolution of the gene expression distributions, SNIFS significantly outperformed TSNI, GENIE3 and JUMP3.
= KS1,Δt α 1, j + KS2,Δt α 2, j +…+ KSm,Δt α m, j (2) KS k−1 k−1 k−1 !! j ,Δtk
where the regression coefficient αi,j describe the influence of gene i on gene j. For each gene, we thus solve a linear regression problem of the form: y=Xα , where y denotes the n−2 vector of KS distances corresponding to time steps from Δt2 to Δtn−1, and X denotes the (n − 2) × m matrix of KS distances corresponding to time steps Δt1 to Δtn−2, for all genes. The regression coefficients contained in the !!m × 1 vector α is further assumed to be sparse, that is each gene has only a small number of regulators (Chen et al. 1999).
2. METHODS
As indicated in Fig. 1D, we employed the ‘least absolute shrinkage and selection operator’ (Lasso) regularization to obtain a sparse solution for α . Specifically, for the inference problem y=Xα , we performed a penalized least square optimization as follows:
2.1 GRN inference from single cell data using SNIFS In this subsection, we describe the method SNIFS. The subroutines of SNIFS are implemented in MATLAB (version 8.6; MathWorks Inc., Natick, MA, 2000) and available upon request.
2
min Y − Xα 2 + λ α !! α
The cross-sectional dataset comprises n data matrices Ds×m ,
!
where Di,j is the transcriptional expression value of gene j in the i-th cell. The first step in SNIFS is the calculation of distribution distances of gene expression levels between time points (see Fig. 1B). In particular, we used the KolmogorovSmirnov distance (Massey 1951) to measure the changes in the gene expression distribution over time, as follows:
(
)
such that! α ≥ 0
(3)
where the scalar weight λ is data dependent and requires parameter tuning (Tibshirani 1996). Here, we used the MATLAB version of GLMNET algorithm to compute the Lasso regularization path, providing the vectors α for different λ values. GLMNET utilizes a cyclical coordinate descent method (Friedman et al. 2010), where the optimization above is iteratively performed one parameter at a time. In order to find the optimal weight λ, we implemented a leave-one-out cross validation (LOOCV). Briefly, in each trial, we generated the Lasso regularization path using the dataset excluding one sample that was assigned as the test set. Subsequently, we computed the mean squared error in predicting the test sample using α vectors in this regularization path as a function of λ. Finally, we chose the optimal λ value corresponding to either the minimum average prediction error λMIN or the minimum average prediction error plus a standard error λSE. The final α * vector was taken from the regularization path generated using the entire dataset at the selected optimal λ parameter.
Fig. 1 illustrates the GRN inference procedure using SNIFS. The input data for SNIFS (see Fig. 1A) are single cell readings of gene expression data at uniformly-spaced time points, for example from single cell real-time PCR. Let s be the number of samples or individual cells, m be the number of genes, and n be the number of measurement time points.
KS j , Δ = max Fˆ j ,t ( d ) − Fˆ j ,t ( d ) k k+1 k d !!
1
(1)
where the distance KS j , Δt denotes the KS distance of gene j k !! in the time window Δtk, Fˆ j ,t d denotes the cumulative !! k distribution function of gene j constructed using the single cell expression data of gene j at time point tk (k = 1, 2, …, n) and d denotes the (random) gene expression variable. Presently, we consider the scenario where single cell expression data are taken at evenly spaced time points. When
()
2.2 TSNI We compared SNIFS and the use of time series single cell expression data, to a GRN inference algorithm using population-averaged data, called TSNI (Time Series Network Identification) (Bansal et al. 2006). TSNI relies on a linear 2
2016 IFAC FOSBE October 9-12, 2016. Magdeburg, Germany
Nan Papili Gao et al. / IFAC-PapersOnLine 49-26 (2016) 147–152
149
A Single(cell*popula/on*snapshot*data* B*
Gene*1*
Gene*2*
14
12
12
10
.**.**.***
8
t1*
8 6
6 4
4 2
0 7.6
7.64
7.66
7.68
7.7
7.72
7.74
7.76
7.78
7.8
#10
t2*
-3
0 7.45
14
12
10
10
8
6
7.55
7.6
7.65
0 7.5
7.7
7.6
7.65
7.7
7.75
KS ,KS2,Δt ,…,KSm,Δt 1 1 !! 1,Δt1
7.8
#10
-3
.**.**.***
8
6
4
4 2
-1.6
-1.4
-1.2
-1
-0.8 -3
0 -4
-3
-2
-1
0
1
2
3
4
5
0 7.8
6
.**.**.***
10 6
8 6
4
2
3.4
3.6
3.8
#10
0 3.8
8.2
8.3
8.4
8.5
8.6
8.7
#10
-3
KS ,KS2,Δt ,…,KSm,Δt n−1 n−1 !! 1,Δtn−1
12
10
8
6
4
4 2
3.2
8.1
14
16 14 12
8
3
8
.* .* .*
18
10
2.8
7.9
#10 -4
.* .* .*
12
C*
7.55
12
2
-1.8
#10
0 2.6
6
4
10
.* .* .*
t n"
8
#10 -3
6
2
0 -2
7.5
8
4
10
2
2
7.62
14
12
KS distance
Gene*m" 14
12 10
2 4
4.2
4.4
-3
4.6
4.8
5
#10
-3
0 2.4
2.6
2.8
3
3.2
3.4
3.6
#10 -3
D
For each gene j:
⎛ KS j ,Δt2 ⎜ ⎜ KS j ,Δt 3 ⎜ ⎜ ! ⎜ KS ⎜ j ,Δt n−1 !!⎝
⎞ ⎛ KS 1,Δt1 ⎟ ⎜ ⎟ ⎜ KS1,Δt 2 ⎟ =⎜ ⎟ ⎜ ! ⎟ ⎜ KS ⎟⎠ ⎜⎝ 1,Δt n−2
KS2,Δt
KS2,Δt
! KS2,Δt
1 2
n−2
"
KSm,Δt
"
KSm,Δt
" ! " KSm,Δt
1 2
n−2
⎞⎛ α ⎟ ⎜ 1, j ⎟⎜ α ⎟ ⎜ 2, j ⎟⎜ ! ⎟⎜ α ⎟⎠ ⎝ m, j
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
LASSO 1 2
α*
1,j
α*2,j
j
.* .* .*
m
Fig. 1. Sparse Network Inference for Single Cell Data. (A) Input data: time-stamped cross-sectional data of gene expression. (B) Step 1: calculation of the Kolmogorov-Smirnov (KS) distance of gene expression distribution over each time step and for each gene. (C) Step 2: formulation of the GRN inference as a linear multivariate regression problem. (D) Step 3: Lasso regression algorithm to find sparse solution.
dynamical system modelling of the gene transcriptional process:
x! (t )= Ax(t k )+ Bu(t k ) !! k
expression profiles of a target gene are estimated from the expression profiles of the other genes. GENIE3 was the best performer in DREAM4 in silico network inference challenge using multifactorial perturbation data, and was also among the top performers of DREAM5 challenge (Marbach et al. 2009; Marbach et al. 2012). Recently, Ocone et al. applied GENIE3 to single cell expression data to obtain the skeleton of the GRN (Ocone et al. 2015). Based on this skeleton and a pseudo-time ordering of the cells using diffusion map, they constructed a dynamic model for the final inference step. We faced difficulties in generating the pseudo-time ordering of cells for the data in the case study, since the data did not cover the entire dynamic trajectories of the cell state. Nevertheless, GENIE3 could still be employed and the outputs of GENIE3 were directly comparable with those of SNIFS.
(4)
describing the time evolution of the (average) concentration of mRNA x(t) whose expression is regulated by various regulator genes as stipulated in the GRN matrix A, and is also affected by the perturbations B and the inputs u. For the inference of the GRN, TSNI proceeds with the discrete time version of the above model, given by:
x k+1 = ⎡ A d ⎣ !!
⎡ x Bd ⎤ ⎢ k ⎦⎢ u ⎣
⎤ ⎥ ⎥⎦
(5)
where the subscripts d indicate the discrete time matrices. The inference of Ad follows a three step procedure, involving (i) time series data smoothing, (ii) generation of artificial time series data by interpolation, such that the number of time points is larger than the number of genes, and (iii) solving the linear discrete model for Ad and Bd in a lower dimensional space by using principal component analysis. Finally, TSNI calculates the continuous time GRN matrix A from the inferred Ad using a bilinear transformation.
2.4 JUMP3 JUMP3 is an extension of GENIE3 to time series data. The algorithm combines the ON/OFF model of gene expression with non-parametric extra-trees approach, resulting in a hybrid strategy between model-based and model-free methods (Huynh-Thu and Sanguinetti 2015). In particular, to reconstruct the network topology, JUMP3 uses time series gene expression data to generate an ensemble of decision trees for each target gene j. From the tree-based model, the algorithm computes the importance score for possible regulators of a target gene j, i.e. genes capable of predicting the binary promoter state of the target gene j. These confidence scores provide the weights for directed regulatory edges of the predicted network.
2.3 GENIE3 GENIE3 is an algorithm capable of inferring GRNs using tree-based ensemble method, including random forests and extra-trees (Huynh-Thu et al. 2010). GENIE3 decomposes the inference of GRN among m genes into m independent regression problems, where in each regression problem, the
2.5 In silico data generation 3
2016 IFAC FOSBE 150 October 9-12, 2016. Magdeburg, GermanyNan Papili Gao et al. / IFAC-PapersOnLine 49-26 (2016) 147–152
GNW
A
B
t1#
Predicted Adjacency Matrix
D ⎛ ⎜ ⎜ ⎜ ˆ A=⎜ ⎜ ⎜ ⎜ ⎜⎝ !!
0 0 0 0 0 " 0
X 0 X 0 X " 0
0 0 0 0 X " 0
X 0 X 0 X " 0
0 0 0 0 0 " 0
! ! ! ! ! # !
0 0 0 0 0 " 0
⎞ ⎟ ⎟ ⎟ ⎧ X ⎟ with aˆ = ⎪ ⎨ ij ⎟ ⎪ ⎩ 0 ⎟ ⎟ ⎟⎠
sectional data of the gene expression distribution. In total, we generated single cell expression data for 8 equally spaced time points between t = 0.1 and t = 2, allowing the gene expression dynamics to reach steady state.
In silico Single-Cell Time Series Data
if i regulates j otherwise
t2#
C
.##.##.###
We evaluated the quality of the inferred GRN by computing the areas under the Receiver Operating Characteristic (AUROC) and the precision-recall (AUPR) curve (Marbach et al. 2010). More specifically, from the estimated α*’s, we first ranked edges (gene regulations) based on the magnitudes of αij coefficients. Subsequently, we computed the numbers of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions as we included more edges starting from the highest ranked edge. Based on these numbers, we constructed the receiver operating characteristic curve, which is a plot of the true positive rates (TPR = TP/(TP+FN)) versus the false positive rates (FPR = FP/(FP+TN)). The precision (TP/(TP+FP)) and recall (TP/(TP+FN)) curves were also created. Finally, we evaluated the area under each of these curves. Here, higher AUROC and AUPR indicate more accurate GRN predictions.
t n"
GENE REGULATORY NETWORK INFERENCE ALGORITHM
Fig. 2. Workflow for performance evaluation of SNIFS, TSNI, GENIE3, and JUMP3. (A) In silico network generation using GeneNetWeaver. (B) Time- stamped cross-sectional single cell expression data generation. (C) GRN inference by SNIFS, TSNI and GENIE3. (D) The predicted GRN matrix with a non-zero element αij indicates the presence of a regulation (activation or inhibition) by gene i on gene j.
In order to assess the performance of SNIFS, we generated in silico single cell expression datasets. As illustrated in Fig. 2, we first obtained random 10-gene and 20-gene subnetworks of Escherichia coli and Saccharomyces cerevisiae (yeast) GRNs using GeneNetWeaver (GNW), the same benchmark network generator used in DREAM4 in silico network inference challenge (Marbach et al. 2009). For the case study, we further remove self-regulatory feedbacks from these subnetworks. Given the structure of the GRN, we simulated synthetic single cell expression data using a stochastic differential equation (SDE) model of the mRNA, according to: (Pinna et al. 2010)
⎛ n ⎛ ⎞ xi ⎞ dx j = V ⎜ β ∏ ⎜ 1 + α i , j − θ x j ⎟ dt + σ x j dW(t ) ⎟ xi + 1 ⎠ ⎝ i=1 ⎝ ⎠
3. RESULTS For testing the performance of SNIFS, we generated 20 subnetworks of E. coli GRNs, out of which 10 were 10-gene GRNs and another 10 were 20-gene GRNs. We also generated 20 subnetworks of yeast GRNs in the same composition. For each subnetwork, we simulated single cell gene expression data of 100 cells and for 8 time points, following the procedure described above. We applied SNIFS to these datasets using two different settings for selecting the optimal λ value, i.e. using either λMIN or λSE. In addition, we also obtained GRN predictions from TSNI and JUMP3 using the average values of the gene expression data (over 100 cells), and from GENIE3.
(6)
!!
Table 1 reports the AUROC and AUPR values of the inferred GRNs from TSNI, GENIE3 (with extra trees option), JUMP3 and SNIFS. For 10-gene subnetworks, SNIFS using both strategies significantly outperformed TSNI, GENIE3 and JUMP3 in AUROC and AUPR (p-value < 0.05). For 20-gene subnetworks, SNIFS also produced much more accurate GRN predictions than TSNI, GENIE3, and JUMP3 based on AUROC and AUPR (p-value < 0.05). Notably, the AUPR values from SNIFS were much higher than those from the other inference methods. SNIFS employing λMIN and λSE did not differ significantly. As expected, larger networks were more difficult to infer, especially in AUPR values.
where xj represents the mRNA level of gene j, αi,j describes the regulation of the expression of gene j by gene i, β denotes the basal transcriptional rate, θ is the mRNA degradation rate constant, and σ and V are scaling parameters. The term dW(t) describes the random Wiener process, used to simulate intrinsic stochastic dynamics of gene expression (Wilkinson 2009). We set αij to 1 for activation, to −1 for repression, and to 0 otherwise. For the main datasets in the case study, we set the parameters to the following: V=30, β =1, θ=0.2, and σ=0.1. We integrated the SDE model above using the EulerMaruyama method (Higham. 2001) with initial conditions xj(0) set equal to 0. As shown in Fig. 2B, for each GRN structure, we simulated cross-sectional single cell dataset consisting of 100 cells per time point, a number that is consistent with Fluidigm Biomark® platform (Pieprzyk and High 2009). In order to imitate the real life scenario where cells are lysed before the gene expression assay (Gao et al. 2011; Moignard et al. 2015), we simulated independent batches of 100 SDE simulations for different time points. The simulated datasets thus represent time-stamped cross-
In order to test the robustness of these methods, we further generated single cell expression data using different σ parameters and different numbers of time points. Increasing the value of σ means higher intrinsic biological noise. Table 2 gives the average AUROCs and AUPRs over the 10-gene E. coli and yeast subnetworks. The performance of SNIFS dropped with increasing biological stochasticity, while that of TSNI did not. This trend is expected since in the SDE model (6), biological noise acted only as a nuisance variable. TSNI was less affected by increasing noise because the method 4
2016 IFAC FOSBE October 9-12, 2016. Magdeburg, Germany Nan Papili Gao et al. / IFAC-PapersOnLine 49-26 (2016) 147–152
151
Table 1. Evaluation of GRN Inference using TSNI, GENIE3, JUMP3 and SNIFS
employed population averages as well as data smoothing. Nevertheless, SNIFS performs better than TSNI in terms of AUPR. Meanwhile, the number of time points used did not significantly affect the performance of the methods. However, since SNIFS involves a cross validation step using LOOCV, the number of time points should not be too low. In such a scenario, we recommend again to employ interpolation to generate artificial time points of KS distances.
regularization. For this reason, the GRN inference in SNIFS can be easily parallelized and performed efficiently. We compared SNIFS to a time-series network inference method TSNI since the two methods share essentially the same formulation in inferring the GRNs. In particular, one can view the linear model in (2) as a discrete time model in (5), for KS distributional distances. In contrast to SNIFS, TSNI did not make any assumption on network sparsity. By comparing SNIFS to TSNI, we demonstrated that using the full distributional information could lead to significantly more accurate GRN predictions than using only the population-averaged data. While the performance of SNIFS was better than TSNI, GENIE3, and JUMP3, the AUROCs of the GRN predictions were rather low, especially for 20-gene networks. However, the low AUROCs may arise from network inferability issue, a well known problem in GRN inference (Szederkényi et al. 2011; Ud-Dean and Gunawan 2014). To the best of our knowledge, assessing network inferability from single cell data has not yet been addressed in the literature, nor has the optimal experimental design to alleviate such an issue.
Table 2. Performance of SNIFS on datasets with different biological noise and time points
σ 0.1 0.2 0.3 0.4 Time/points 10 9 8 7 6
TSNI 0.52 0.52 0.52 0.52 0.50 0.50 0.52 0.53 0.54
10-GENE NETWORK AUROC Lambda_min Lambda_1se TSNI A 0.62 0.54 0.13 0.60 0.55 0.13 0.54 0.54 0.13 0.51 0.51 0.13 B 0.62 0.53 0.12 0.61 0.50 0.12 0.62 0.54 0.13 0.60 0.51 0.13 0.59 0.54 0.13
AUPR Lambda_min Lambda_1se 0.34 0.29 0.21 0.20
0.29 0.26 0.23 0.21
0.34 0.32 0.34 0.31 0.34
0.30 0.23 0.29 0.25 0.31
4. DISCUSSIONS
Despite its advantages over existing strategies, SNIFS has several limitations. First, the method generates predictions only for the topology of GRNs without any indications on the modes of the gene regulations, i.e. whether the regulation is activating or repressing. We are presently working toward a new method, which uses the output of SNIFS as a skeleton of the GRN in a subsequent step. In addition, the input dataset for SNIFS need to have uniform time mesh. As mentioned in Section 2.1, one can interpolate KS distances from uneven time steps to estimate these distances for uniform time mesh. Interpolation of KS distances may also be necessary when the number of time points is low, for cross-validation purposes. Despite these limitations, SNIFS represents a promising and efficient strategy for extracting insights in the form of network structure/topology from single cell data. The formulation based on a sparse linear regression model and distributional distances is general enough to handle not only
In this work, we developed SNIFS for inferring gene regulatory network structures from time-stamped crosssectional single cell expression datasets. While longitudinal expression data of single cells should carry more information than cross-sectional data, the present technology for such measurements (e.g. using fluorescence proteins) is still limited to only a handful of genes. In contrast, large-scale cross-sectional single cell data of mRNA expression have been reported in recent literature (Guo et al. 2010; Moignard et al. 2013; Kouno et al. 2013). SNIFS relies on a linear model describing the time evolution of distributional distances governed by the GRN. The network inference is formulated as finding a sparse solution to a linear regression problem independently for each gene, employing the Lasso
5
2016 IFAC FOSBE 152 October 9-12, 2016. Magdeburg, GermanyNan Papili Gao et al. / IFAC-PapersOnLine 49-26 (2016) 147–152
single cell gene expressions, but also other single cell quantitative data such as protein amounts from mass cytometry (Bendall et al. 2011).
Marbach, D., T. Schaffter, C. Mattiussi, and D. Floreano. (2009). Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16 (2), 229–39. Massey, F. J. (1951). The Kolmogorov-Smirnov Test for Goodness of Fit. J. Am. Stat. Assoc. 46 (253), 68 – 78. Moignard, V., I. C. Macaulay, G. Swiers, F. Buettner, J. Schütte, F. J. Calero-Nieto, S. Kinston, et al. (2013). Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput singlecell gene expression analysis. Nat. Cell Biol. 15 (4), 363–72. Moignard, V., S. Woodhouse, L. Haghverdi, A. J. Lilly, Y. Tanaka, A. C. Wilkinson, F. Buettner, et al. (2015). Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat. Biotechnol. 33(3), 269–276. Ocone, a., L. Haghverdi, N. S. Mueller, and F. J. Theis. (2015). Reconstructing gene regulatory dynamics from high-dimensional single-cell snapshot data. Bioinformatics 31 (12), 89–96. Pieprzyk, M., and H. High. (2009). Fluidigm Dynamic Arrays provide a platform for single-cell gene expression analysis. Nat Methods 6 (7). Pina, C., J. Teles, C. Fugazza, G. May, D. Wang, Y. Guo, S. Soneji, et al. (2015). Single-Cell Network Analysis Identifies DDIT3 as a Nodal Lineage Regulator in Hematopoiesis. Cell Rep. 11 (10), 1503–10. Pinna, A., N. Soranzo, and A. de la Fuente. (2010). From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PloS One 5 (10): e12912. Sandberg, R. (2013). Entering the era of single-cell transcriptomics in biology and medicine. Nat Methods 11 (1), 22–24. Stegle, O., S. A. Teichmann, and J. C. Marioni. (2015). Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16 (3), 133–45. Szederkényi, G., J. R. Banga, and A. A. Alonso. (2011). Inference of complex biological networks: distinguishability issues and optimization-based solutions. BMC Syst. Biol. 5 (1), 177. Teles, J., C. Pina, P. Edén, M. Ohlsson, T. Enver, and C. Peterson. (2013). Transcriptional regulation of lineage commitment--a stochastic model of cell fate decisions. PloS Comput. Biol. 9 (8), e1003197. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J Royal. Statist. Soc B., Vol. 58, No.1. Ud-Dean, S. M. M., and R. Gunawan. (2014). Ensemble inference and inferability of gene regulatory networks. PloS One 9 (8), e103812. Wilkinson, D. J. (2009). Stochastic modelling for quantitative description of heterogeneous biological systems. Nat. Rev. Genet. 10 (2), 122–33. Woodhouse, S., V. Moignard, B. Göttgens, and J. Fisher. (2016). Processing, visualising and reconstructing network models from single-cell data. Immunol. Cell Biol. 94 (3), 256–65.
REFERENCES Bansal, M., G. Della Gatta, and D. Di Bernardo. (2006). Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22 (7), 815–822. Bendall, S. C., E. F. Simonds, P. Qiu, E. D. Amir, P. O. Krutzik, R. Finck, R. V Bruggner, et al. (2011). Singlecell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332 (6030), 687–96. Chen, H., J. Guo, S. K. Mishra, P. Robson, M. Niranjan, and J. Zheng. (2014). Single-cell transcriptional analysis to uncover regulatory circuits driving cell fate decisions in early mouse development. Bioinformatics 31 (7), 1060–1066. Chen, T., V. Filkov, and S. S. Skiena. (1999). Identifying gene regulatory networks from experimental data. In Proc. third Annu. Int. Conf. Comput. Mol. Biol. RECOMB '99, 94–103. New York, New York, USA, ACM Press. Friedman, J., T. Hastie, and R. Tibshirani. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33 (1), 1–22. Gao, W., W. Zhang, and D. R. Meldrum. (2011). RT-qPCR based quantitative analysis of gene expression in single bacterial cells. J. Microbiol. Methods 85 (3), 221–7. Guo, G., M. Huss, G. Q. Tong, C. Wang, L. Li Sun, N. D. Clarke, and P. Robson. (2010). Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev. Cell18 (4), 675–85. Higham., D. J. (2001). An Algorithmic Introduction to Numerical Simulation of Stochastic Differential Equations. SIAM Rev. 43 (3), 525–546. Huynh-Thu, V. A., A. Irrthum, L. Wehenkel, and P. Geurts. (2010). Inferring regulatory networks from expression data using tree-based methods. PloS One 5 (9), e12776. Huynh-Thu, V. A., and G. Sanguinetti. (2015). Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics 31 (10), 1614–22. Kouno, T., M. de Hoon, J. C. Mar, Y. Tomaru, M. Kawano, P. Carninci, H. Suzuki, Y. Hayashizaki, and J. W. Shin. (2013). Temporal dynamics and transcriptional control using single-cell gene expression analysis. Genome Biol. 14 (10), R118. Marbach, D., J. C. Costello, R. Küffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, M. Kellis, J. J. Collins, and G. Stolovitzky. (2012). Wisdom of crowds for robust gene network inference. Nat. Methods 9 (8), 796–804. Marbach, D., R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky. (2010). Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. U. S. A. 107 (14), 6286–91. 6