Evaluation of artificial time series microarray data for dynamic gene regulatory network inference

Evaluation of artificial time series microarray data for dynamic gene regulatory network inference

Accepted Manuscript Evaluation of artificial time series microarray data for dynamic gene regulatory network inference P. Xenitidis , I. Seimenis , S...

1MB Sizes 0 Downloads 41 Views

Accepted Manuscript

Evaluation of artificial time series microarray data for dynamic gene regulatory network inference P. Xenitidis , I. Seimenis , S. Kakolyris , A. Adamopoulos PII: DOI: Reference:

S0022-5193(17)30218-7 10.1016/j.jtbi.2017.05.010 YJTBI 9067

To appear in:

Journal of Theoretical Biology

Received date: Revised date: Accepted date:

25 July 2016 13 March 2017 5 May 2017

Please cite this article as: P. Xenitidis , I. Seimenis , S. Kakolyris , A. Adamopoulos , Evaluation of artificial time series microarray data for dynamic gene regulatory network inference, Journal of Theoretical Biology (2017), doi: 10.1016/j.jtbi.2017.05.010

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights The influence of a number of factors on the inference of a gene regulatory network using microarray data was examined.  The amount of information in microarray data was evaluated using a system theory approach.  The relationship between the inference performance in the time domain and the true system parameter identification was investigated. .  Crucial factors were examined using a real GRN and acquired results confirmed simulation findings with artificial data.  Different initial conditions were used as an alternative triggering approach.

AC

CE

PT

ED

M

AN US

CR IP T



ACCEPTED MANUSCRIPT

Evaluation of artificial time series microarray data for dynamic gene regulatory network inference P. Xenitidis1, I. Seimenis1, S. Kakolyris2 and A. Adamopoulos1 1

Medical Physics Laboratory, School of Medicine, Democritus University of Thrace, Alexandroupolis, Greece Department of Medical Oncology, School of Medicine, Democritus University of Thrace, Alexandroupolis, Greece

explicitly in a GRN. Some are acting as hidden variables, but nevertheless their influence is taken into account. A GRN is a model that is trying to describe the communication between genes including all regulatory processes giving a description of the cellular regulation (Das, 2010; Hecker et al., 2009, Hitoshi Iba, 2016; Penfold and Wild, 2011) Inference of a GRN is the process where experimental data are used in order to identify how genes interact with each other. A network is fully described when the network structure and the network parameters are known. The network structure provides information about the existence of an interaction between two genes and is graphically represented by a straight line (Hecker et al., 2009; Penfold and Wild, 2011).

AN US

Abstract— High-throughput technology like microarrays is widely used in the inference of gene regulatory networks (GRNs). We focused on time series data since we are interested in the dynamics of GRNs and the identification of dynamic networks. We evaluated the amount of information that exists in artificial time series microarray data and the ability of an inference process to produce accurate models based on them. We used dynamic artificial gene regulatory networks in order to create artificial microarray data. Key features that characterize microarray data such as the time separation of directly triggered genes, the percentage of directly triggered genes and the triggering function type were altered in order to reveal the limits that are imposed by the nature of microarray data on the inference process. We examined the effect of various factors on the inference performance such as the network size, the presence of noise in microarray data, and the network sparseness. We used a system theory approach and examined the relationship between the pole placement of the inferred system and the inference performance. We examined the relationship between the inference performance in the time domain and the true system parameter identification. Simulation results indicated that time separation and the percentage of directly triggered genes are crucial factors. Also, network sparseness, the triggering function type and noise in input data affect the inference performance. When two factors were simultaneously varied, it was found that variation of one parameter significantly affects the dynamic response of the other. Crucial factors were also examined using a real GRN and acquired results confirmed simulation findings with artificial data. Different initial conditions were also used as an alternative triggering approach. Relevant results confirmed that the number of datasets constitutes the most significant parameter with regard to the inference performance.

CR IP T

2

(a)

a 41

M

ED

PT

CE

Keywords— Gene regulatory networks, inference, microarray data.

AC

I. INTRODUCTION

Gene regulation is a process that expands on multiple spatial and temporal levels. Gene activities include mRNA transcription, mRNA translation to proteins and transcription factors. Regulatory interactions also include activities such as protein and transcript degradation, protein posttranslational modification, signal-transduction, metabolism and microRNA regulation. All activities are not included

a 21

gene 1

gene 2

a 32

a13

a 24

gene 4

gene 3

aˆ 12

(b) gene 1

aˆ 14

gene 2

aˆ 13 aˆ 41

aˆ 31 aˆ 21

aˆ 42

aˆ 32 aˆ 23

aˆ 24

aˆ 34 gene 4

aˆ 43

gene 3

Fig. 1. (a) The original network and, (b) the inferred network. In this example, the original network consists of four genes and five edges. The inferred network is a fully connected network.

A GRN can be represented by a connectivity matrix. Each coefficient aij indicates the strength of the influence of gene j on gene i. The sign of a coefficient describes the nature of the interaction, indicating an activating (positive sign) or inhibitory (negative sign) relationship between

ACCEPTED MANUSCRIPT

(a) Input

Unknown system

Output

(b) Gene expressions

M

External perturbation

Fig. 2. (a) The system identification process: The system is triggered by an

ED

input and its response (output) is measured. (b) Microarray experiment: External perturbation is applied on a biological system. The expression levels of the genes are measured.

CE

PT

Several mathematical models for the GRN inference have been used in the past, such as: Boolean networks, linear and nonlinear differential equations, Bayesian networks, dynamic Bayesian networks, Petri nets, cellular automata, neural networks, fuzzy logic models, information theory models and graphical Gaussian models (Hecker 2009; Ristevski 2013). There were two key points in the choice of the inference algorithm and the artificial data generation model. The first point is that we are interested in dynamical models that involve time-series training data and have the ability to capture the system’s dynamics in the time domain. The second point is that we are interested in simple linear models. Our main purpose is to not the identification of nonlinear behavior but we are rather focused on the amount of the information that exists in microarray data. Having the above two points in mind, we decided to use ordinary differential

AC

CR IP T

equations and specifically the Time Series Network Identification (TSNI) algorithm proposed by Bansal (Bansal et al., 2006). In our study we used artificial time-series microarray data. The necessity of using artificial data instead of real life microarray data is supported by the fact that the evaluation of an inference algorithm and subsequently the evaluation of the information in microarray data require the comparison between the inferred and the real system. In a real microarray experiment the biological system under investigation is totally or partially unknown. The use of a model for the creation of artificial data is helping us in two ways. Firstly, it allows us to access the original system and calculate the inference performance. Secondly, it allows us to create various triggering scenarios and explore the effect of crucial factors on the inference performance. The Precision-Recall curve (P-R) (Fawcett, 2006) and the area under the P-R curve (AUPR) were used for performance evaluation. Relatively low values of the AURP in simulations using artificial microarray data motivated us to investigate which factors affect the inference performance. The comparison of a GRN inference with other classical system identification schemes (Ljung, 1999) reveals certain particularities in the way that a GRN is excited in a microarray experiment. This fact helped us to choose three potentially crucial factors, the time separation of directly triggered genes, the percentage of directly triggered genes and the type of excitation signal. Two additional factors, the network size and the network sparseness were also examined. The combining effect of two of the above mentioned factors was also examined by the use of three dimensional (3D) plots. We also tested how various triggering scenarios affect a real GRN by using the SOS DNA repair network in bacterium Escherichia coli. Finally, datasets with different initial parameters were created and the inference performance was comparatively evaluated. The ultimate goal of this study was to evaluate the adequacy of artificial microarray data to support the inference process and to highlight those factors which crucially affect its outcome, since traditional experimental setups may produce datasets with inadequate information for accurate network identification.

AN US

genes. A simple GRN and the inferred network are shown in Fig. 1. In a system identification process, the system under investigation is triggered with an appropriate input and its response (output) is measured (Fig. 2a). An identification algorithm is applied on the input-output dataset and, a model that describes the relationship between input and output is constructed. Microarray data are often used in the identification of GRN due to their trigger-response nature. In microarray experiments an external perturbation is applied and the gene expression levels are measured (Fig. 2b). Microarray experiments are usually classified into two categories: Static and time-series experiments. In a static experiment, only a single time point of the gene expression is measured at the steady state of the biological system. In the time-series experiments, which are of interest in this study, the biological system is triggered and the gene expression levels are measured over time (Bar-Joseph, 2004; Storey et al., 2005).

II.

METHODOLOGY

The basic steps of the methodology followed are shown in Fig. 3. Firstly, a dynamic GRN model was constructed, based on ordinary differential equations, described by Eq. (1). 

X(tk ) = A X(tk ) + B U(tk )

k = 1 ... M

(1)

ACCEPTED MANUSCRIPT

Inference algortihm

Artificial expression time-series data

Inferred GRN

A Inferred connectivity matrix

Inferred GRN evaluation

PR curves

AN US

A connectivity matrix

Artificial GRN simulation

M

Artificial GRN creation

ED

Fig. 3. The basic steps for the artificial data generation, the inference process and the performance evaluation.

CE

PT

In a typical GRN only a percentage of the genes are directly affected by the external perturbation. This information is presented by the use of the perturbation matrix B. The rest of the genes are indirectly triggered through the connectivity matrix A. Each gene expression level is estimated by adding the direct and the indirect part of the stimulation. A step function is used to model the external perturbation. The directly triggered genes are concurrently triggered as shown in Fig. 4a. After the artificial data creation, we applied the inference algorithm (TSNI) and created the inferred system that is described by the estimated versions of matrices A and B. Finally, the inferred system expression levels were produced through simulation. It must be noted that the TSNI algorithm converts the ordinary differential equations to their discrete form. Then, the system is solved by the use of a pseudo inverse matrix.

AC

All models and simulations were implemented using Matlab (student version 2016a). At first we tested a number of artificial data that were created by the use of the concurrent triggering scenario, shown in Fig. 4a, which resembles real life microarray data. It was noticed that a perfect network recovery (i.e. an accurate inference) could not be achieved. The resulting inference performance was in agreement with the results of Bansal (Bansal et al., 2006). Although the inference algorithm provided models with good matching in the time domain, the matching regarding the connectivity matrix A was far from acceptable. A perfect recovery is considered to be achieved when the estimated connectivity matrix Aest is equal to the original connectivity matrix A. Two possible reasons can lead in poor inference performance. The first reason is low data quality. This means that the training dataset does not contain all the necessary information in order to enable the inference algorithm to achieve a good performance. The second reason is an inherent inability of the inference algorithm to reach a certain level of performance. It was noticed that the triggering scenario used by an inference process that is based on microarray, differentiates from other classical system identification triggering scenarios. Three of the key features that characterize microarray data are the time separation of directly triggered genes, the percentage of directly triggered genes and the type of the triggering function. We decided to adjust these features and practically move away from the real microarray triggering scenario in order to investigate if they act as limiting factors. The first feature that was altered was the time separation of directly triggered genes. A time separation coefficient is defined by the use of (Eq. 2).

CR IP T

where X(tk) is the concentration vector, A is the connectivity matrix, B is the matrix that represents the effect of the perturbation, U is the perturbation vector at time t k and M is the number of sample points (Bansal et al., 2006). In the original TSNI algorithm implementation, it was supported that since the number of time points are less than the number of genes, the system cannot be solved directly and both interpolation and dimensional reduction techniques were used to resolve this problem. In this work, neither interpolation nor dimensional reduction was used since we strongly believe that the key feature for successful identification is not the data dimension but the quantity of information that is contained in the dataset. Additional information on the inference algorithm employed in this study can be found in the Appendix.

Time separation coefficient =

Time separation % (2) Triggering pulse duration

According to the above, a concurrent triggering scenario corresponds to coefficient equal to 0%. This scenario is shown in Fig. 4a. Values between 0% and 100% denote that there is a time overlap between successive pulses. This scenario is shown in Fig. 4b. A successive triggering scenario where each triggering pulse begins immediately after the seizing of the previous pulse corresponds to a coefficient equal to 100%. This scenario is shown in Fig. 4c. Finally, values greater than 100% denote that there is no overlap between successive pulses. A scenario with a 200% time separation coefficient is shown in Fig. 4d. In classical system identification schemes, systems are triggered in such a way that all possible states of the system are stimulated, leading to an inference dataset that contain

ACCEPTED MANUSCRIPT

Gene 1

Perturbation 2

(a) time

(b)

time Gene 2

time

Gene 2

M

(c)

Time seperation

(d)

time

Fig. 4. Four triggering scenarios. (a) 0% time separation, (b) 50% time

PT

separation, (c) 100% time separation (d) 200% time separation between triggering pulses. Each box corresponds to a gene and depicts the relevant triggering time series. (2 genes are shown).

CE

gene is expressed or “turned on” for a specific time period. We replaced the step function with Gaussian noise. It is obvious that triggering a system with the above scenario cannot be applied to biological systems due to limitations in the cells response time. Nevertheless, it can provide valuable information about the inference process. This noise triggering scenario is shown in Fig. 5. In system identification there are several input signals that are commonly used such as the step function, the square function, Gaussian noise and the random binary signal (RBS), (Ljung, 1999; Katayama, 2005; Castilla, 2014).

AC

Perturbation 1 Perturbation 2

Gene 2

time

Fig. 5. Noise as triggering function for 100% time separation. Each box

A factor that is important and characterizes the input signal is the persistence of excitation. It is connected to the power spectral density of the input signal. It refers to the richness of information and the ability to excite all the “modes” of the system. The condition of a persistently exciting input ensures that the output captures the complete dynamic behavior of the system (Doraiswami et al., 2014). The persistence of excitation varies among various input signals (Zhu, 2001; Nelles 2001). The reason that we used the Gaussian noise as an alternative input signal is the fact that it has a wider spectrum and a higher persistence of excitation compared to the step function. Two points are critical for correct system identification (Castilla, 2014): the choice of identification data and the model order. The selection of an adequate set of inputoutput signal allows the identification process to be free from constant systematic error. In control theory, a step function is often used as a triggering function because the identification process is based on the reaction curve. A step function is not always appropriate for the identification of all systems. Often an excitation signal with a wide frequency spectrum and consequently a higher persistence of excitation is needed (Castilla, 2014). Microarray experiments are forcing a step function type system. This fact is suspected to lead in poor identification performance. Apart from the input signal function, we also examined the influence of other factors on the inference performance such as the network size, the network sparseness and the effect of noise present in input data. Afterwards, the connection between the inference performance and the pole placement of inferred system was examined. Finally, we examined the inferred network performance in the time domain.

AN US

Gene 2

ED

Perturbation 1 Perturbation 2 Perturbation 2

Gene 2

Time seperation

time

corresponds to a gene and depicts the relevant triggering time series (2 genes are shown). The pulse function was replaced by Gaussian noise.

time

Time seperation

Gene 1

CR IP T

Perturbation 1

all the necessary information. The three key features, discussed above, are suspected to drive us away that direction. The second feature that was altered was the percentage of directly triggered genes. In a typical GRN the external perturbation is not directly affecting all the genes that are present. A percentage of the participating genes are affected. The third feature that was altered was the type of the triggering function. The system triggering in a microarray experiment is described by a step function. This denotes that when the external perturbation is applied a directly triggered

A. Network size GRN size is typically tens of genes (Segal et al., 2007) and varies in different organisms. In E. coli there are 171

ACCEPTED MANUSCRIPT

parameter and examined how it affects the inference performance. The inference performance was evaluated by the use of the P-R curves. C. Time separation coefficient.

CR IP T

The network size was set to 20 genes. We used a fixed percentage of directly triggered genes equal to 100%. We varied the time separation coefficient in a range from 200% to 0% and examined its influence on the inference performance. The inference performance was evaluated by the use of the P-R curves. D. Network Sparseness.

Biological networks, including GRNs, are known to be sparse and made up of hubs with different degree of connectivity (Ivanov et al., 2009). We examined the influence of the network sparseness on the inference performance. The network size was set to 20 genes. The number of directly triggered genes was varied from 2 to 20. The pulse function triggering was used. The time separation coefficient was equal to 200%. Two types of networks were used. A fully connected network with sparseness equal to 0% and a sparse network with sparseness equal to 50%. The inference performance was evaluated by the use of the P-R curves.

PT

ED

M

AN US

transcription factors and approximately 2000 in mammals (Sauro 2011). In addition, when a specific biological process is considered, the number of transcription factors involved significantly decreases. (Chen 2009). We firstly tried to investigate the effect of the network size on the behavior of artificial networks using various triggering scenarios. Afterwards, we focused on a certain network size (size =20), considering it as representative and went into a more detailed examination. We examined artificial networks with various network sizes, consisting of 10, 20, 50 and 100 genes. We made each network sparse by setting a percentage of the coefficients equal to zero. The network sparseness was arbitrary set to 50%. System stability was forced by imposing the requirements that all poles of the discrete systems are located on the left side of the pole-zero plots. Stability was also checked in the time domain by ensuring that the system time response didn’t include oscillations or unbounded behavior. Firstly, we used the two key factors that correspond to the triggering scenario of Bansal (Bansal et al., 2006). A time separation coefficient equal to 0% and a value of 10% percentage of directly triggered genes. Using Bansal’s two key factors as a reference point, we changed them successively in order to achieve a better performance. We changed the time separation coefficient from 0% to 200%. This was decided to be our first step because the time separation coefficient was proved to be a highly constraining factor and values near zero lead to high performance degradation. Afterward we varied the number of directly triggered genes using the following values: 10%, 50%, 70%, and, finally, 100%. The performance of the algorithm was evaluated using the Precision-Recall curve (P-R curve) (Fawcett, 2006). For each network size, 10 independent computer experiments were performed and the area under the P-R curve was evaluated. For every network size, the mean value of the areas under the P-R curve was used as a performance index.

E. Additive noise on pulse stimulation We examined the influence of noise on the inference performance in two types of networks: (a) a fully connected network with a fixed value of sparseness equal to 0% and (b) a sparse network with a fixed of sparseness equal to 50%. The network size was set to 20 genes. We used a fixed time separation coefficient equal to 200% and the percentage of directly triggered genes was varied between 2 to 20. The inference performance was evaluated by the use of the P-R curves.

CE

B. Percentage of directly triggered genes.

AC

Having obtained results of the influence of the network size on the inference performance, as a second step, the network size was locked to 20 and we explored the effect of the percentage of the directly triggered genes on the inference performance. The network sparseness was arbitrary set to 50%. Two triggering scenarios were examined. According to the first triggering scenario the time separation coefficient value was set to 200%, whereas, following the second triggering scenario, the time separation coefficient value was set to 0%. We used the percentage of directly triggered genes as a

F. Noise as triggering function. We changed the type of the triggering function. |The pulse function was replaced by Gaussian noise, as shown in Fig. 5, and we examined the behavior of the inference process. The inference performance was evaluated by the use of the P-R curves.

ACCEPTED MANUSCRIPT

We focused on the algorithm that performs the transformation of the discrete inferred system to the continuous system. The TSNI algorithm converts the continuous system that uses differential equations to a discrete system that uses difference equations. It solves the system using the pseudoinverse matrix and converts the identified discrete system back to a continuous system. The discrete to continuous system conversion can be accomplished by the use of various techniques such as the bilinear transformation (Tustin’s method), the zero order hold (zoh method) and the first order method (foh method) (Franklin and Powell, 1980; Yang, 2009). The influence of the discrete to analog conversion algorithm on the inference performance was investigated by the use of the P-R curves.

AN US

H. Inferred system poles position.

work. For the ith gene, this is done by setting the elements of line (i) and column (i) in the connectivity matrix of the discrete system, equal to zero. Deleting a gene from a connectivity matrix simply stops the gene contribution to the produced expression levels regarding both the direct and the indirect way. The case of deleting a gene is then compared to the case of diminishing the percentage of directly triggered genes. The comparison is performed in the pole domain. The loss of information in the training dataset is comparable. A system’s response is determined by the location of its zeros and poles (Wescott, 2006). Examples of different pole positions and the corresponding system response can be found in (Levine, 1996 and Tan and Jiang, 2013). It should be noticed though, that the state space model representation used by the TSNI algorithm differentiates from a classical state space model. The system’s order no longer correspond to the number of past input steps that are involved in calculating the present output (Levine, 1996). It corresponds to the number of genes that are involved in the calculation of the output. As a consequence the pole placement in a pole plot provides information about the contribution of each gene to the system’s output.

CR IP T

G. Effect of the discrete to analog conversion algorithm.

M

Our next step was to investigate the correlation between three factors: a) The reduction in the number of the triggered genes b) The number of poles of the inferred discrete system that are near zero. c) The performance of the inference algorithm. Fig. 6 shows the flow diagram that describes the basic steps of the inference process and the points where the above three factors are involved.

Sampling

Inferred discrete system

Discrete to continuous conversion

Poles near zero (factor 2)

AC

Pole plot

CE

`

Discrete artificial data

Pseudoinverse

Inferred analog system

Performance curve

PT

Continuous artificial data

ED

Number of triggered genes selection (factor 1)

Performance evaluation (factor 3)

Fig. 6. Flow diagram showing the basic steps of the inference process. The evaluation of the quality of the information contained in the artificial data is performed by the use of pole plot of the inferred discrete system.

We tested various artificial networks by deleting the interactions that a selected gene offers or receives in a net-

I. Time domain performance. We examined the influence of the number of directly triggered genes on the inference performance in the time domain by using goodness of fitness criteria on the produced time series data. We estimated the goodness of fitness between the data and the reference data. The normalized root mean square error (NRMSE) was selected as a cost function: NRMSE(i) = 1 -

data_ref(:,i) - data(:,i) data_ref(:,i) - mean(data_ref(:,i))

(3)

A NRMSE value equal to 1 corresponds to perfect matching and as values moves towards -∞ the matching decreases. Afterward, we evaluated the average NRMSE over all inputs. This results in a total system time domain performance evaluation index. We generated an artificial network with size equal to 20. We set the network sparseness equal to 50%. We used the successive triggering scenario with a time separation coefficient equal to 100% and a pulse type triggering function. We repeated the inference process 20 times and generated 20 inferred systems. Each time we used a different percentage of directly triggered genes, ranging from 1 to

ACCEPTED MANUSCRIPT

Output (Reference data)

gene 1

gene 20

Original artificial network

gene 20

gene 1

gene 2 gene 3 . . .

Inferred networks

gene 20

2 triggering scenarios

. . .

gene 1 gene 2 gene 3

Compare each reference and inferred output

. . .

Normalized root mean square error (NRMSE)

Average NRMSE over all outputs

gene 20

III.

RESULTS

n=20

A. Network size

M

. . .

gene 1 gene 2 gene 3

Variable number of triggered genes used for inference

ED

gene 2 gene 3

AN US

Input

Fig. 7. Block diagram that describes how the average NRMSE was calcu-

PT

lated.

J. Limitations.

CE

The ordinary differential equations (ODE) used in the creation of the artificial microarray data and the inference algorithm (TSNI) are based on the first order assumption. It is assumed that a gene expression at a certain time point depends on only the gene expression at the immediate previous time point. This is referred as first order or one time unit delay or lag (Mundra et al., 2013). Delay is inserted due to various factors such as transcription and translation among them. Many algorithms allow only one time step. There are also algorithms that allow multiple time steps delay (Lo et al., 2015). The effect of using a first order approximation has not been handled here.

AC

Another point that needs attention is the assumption, used by both Bansal (Bansal et al., 2006) and us, that microarray data follow the concurrent triggering scenario (Fig. 4a). This means that the external perturbation has a concurrent effect on the directly triggered genes through the perturbation matrix B. The effect of possible deviation from this assumption has not been handled here. Finally, no attempt was made to identify nonlinear GRN behavior since both the inference algorithm (TSNI) and the artificial microarray data creation model were linear. Biological systems are known to be inherently nonlinear (Segel, 1980) therefore, alternative methods such as S-Systems and nonlinear dynamical systems (NDS) have been proposed (Yang et al., 2012; Klemm, 2008).

CR IP T

100%. In Fig. 7 it is shown a block diagram that describes how the average NRMSE was calculated. We used two scenarios for the goodness of fitness evaluation. In the first scenario the goodness of fitness was evaluated by triggering the same genes that were used for its inference. This scenario resembles a microarray experiment case since the directly triggered genes are fixed. In the second scenario goodness of fitness was evaluated by triggering all the genes in the network. By this way the network is evaluated in both known and unknown triggering.

The parameters that were used as variables were the network size, the time separation coefficient and the number of directly triggered genes. The area under the P-R curve was used for the performance evaluation. The results are shown in Table 1.

Time separation 0% 100% 100% 100% 100%

Triggered genes 10% 10% 50% 70% 100%

n=10

0.21 0.25 0.35 0.60 1

0.19 0.32 0.42 0.68 1

n=50

0.18 0.21 0.48 0.75 1

n=100

0.17 0.40 0.58 0.73 1

Table 1. Performance of the identification algorithm for various parameter values. The area under the P-R curve is used for performance evaluation. The parameters taken into account are the network size, the percentage of directly triggered genes, and the percentage of the time separation between the triggering pulses of neighbor genes.

It should be noticed that the first line of Table 1 corresponds to the same conditions that Bansal used for his simulations (Bansal et al., 2006). Time separation was equal to 0% and 10% of the genes were directly triggered. We noticed that the inference performance in the case of the zero time separation is rather poor. The transition from the concurrent triggering scenario to the successive triggering scenario and the increase in the number of the directly triggered genes are obviously enhancing the identification

ACCEPTED MANUSCRIPT

B. Percentage of directly triggered genes. We examined the effect of the percentage of directly triggered genes on the inference performance. The network size was set equal to 20. P-R curve n = 20 n = 19 n = 18

Accurancy PPV

n = 15

n = 12 n = 10

AN US

C. Time separation coefficient.

n=5

n=2

Sensitivity Se

(b)

ED PT

n = 20, 19, 18, 15, 12, 10, 5, 2

AC

CE

Accurancy PPV

M

P-R curve

We examined the effect of the degree of time separation between the triggering pulses on the inference performance. The network size was set equal to 20. We used a fixed percentage of directly triggered genes equal to 100%. We varied the time separation coefficient in a range from 200% to 0%. The network sparseness is equal to 50%.This range allows us to compare the concurrent and the time separation scenario. A pulse type triggering scenario was used. The PR curve is shown in Fig 9. P-R curve 80% 3% 50% 25%

2%

Accurancy PPV

(a)

We varied the number of directly triggered genes from 2 to 20. The network sparseness was equal to 50%. A pulse type triggering scenario was used. Simulations were carried out for a pulse type triggering scenario with a time separation coefficient equal to 200% (Fig. 8a) and a time separation coefficient equal to 0% (Fig. 8b) It is noticed that when we use all the available genes in the time separation triggering scenario (20 genes) we have a perfect recovery. On the other hand, the algorithm performance drops when we decrease the number of genes that are directly triggered. It is also clear that in order to see the influence of the percentage of triggered genes on the inference performance, a time separation coefficient near 100% is needed. A final point that is worth mentioning is the fact that when a pulse type stimulation is used, the 100% of the available genes are required to be directly triggered in order to achieve a 100% performance.

CR IP T

performance. The last line of Table 1 gives the parameters that lead to perfect identification.

5% 10%

1%

Time separation coeffiient = 0%

Sensitivity Se

2

Sensitivity Se

Fig. 8. P-R curve for (a) the successive triggering scenario and for (b) the concurrent triggering scenario. In both cases the networks size was set equal to 20 and the number of genes that are directly triggered varies from 2 to 20. The network sparseness is equal to 50%. A pulse type triggering scenario was used.

Fig. 9. P-R curves for the time separation triggering scenario. The network size was set equal to 20. The time separation coefficient (T) varies from 0% to 200%. Percentage of directly triggered genes was equal to 100%.

ACCEPTED MANUSCRIPT

P-R curve (a) n = 15% n = 12%

n = 18%

n = 20% n = 19%

n = 10%

Accurancy PPV

n = 5%

n = 20%

ED

n = 19%

n = 15%

PT

Accurancy PPV

n = 12%

n = 10%

CE

n = 5% n = 2%

AC

We also checked how the inference performance was affected when two parameters were altered simultaneously. The network sparseness and the number of directly triggered genes were selected as variables, since previous simulations provided similar results for various network sizes. The area under the P-R curve was used as a measure of the inference performance and corresponding three dimensional (3D) plots were extracted. The network size was set equal to 20, while the pulse function triggering was employed. Initially, the time separation was set equal to 100%. The combined inference performance is shown in Fig. 11a. Decrease of sparseness or increase of the number of directly triggered genes leads to an increase of the inference performance, which can reach a maximum value of 1. The sparseness significantly suppresses the dynamic range of the effect of the number of directly triggered genes on the inference performance from [0.2,1] to [0.7,1]. Then, the time separation was set equal to 0%. Again, decrease of sparseness or increase of the number of directly triggered genes leads to an increase of the inference performance. In this case, however, the maximum performance is around 0.65. As clearly shown in Fig. 11b, in this case the sparseness suppresses to a lesser degree the dynamic range of the inference performance, as a result of the variation in the number of directly triggered genes, compared to the previous scenario of 100% time separation.

AN US

P-R curve

M

Sensitivity Se

n = 18%

The next parameter that was examined was the network sparseness. The network size was set equal to 20 genes. We varied the number of directly triggered genes from 2 to 20. Simulations were carried out for a pulse type triggering scenario with a time separation coefficient equal to 200%. Simulations were carried out for two values of sparseness, 100% and 50%. The P-R curves are shown in Fig. 10. Comparing Fig. 10a to Fig, 10b, it is obvious that increasing of the network sparseness, while keeping the percentage of triggered genes constant, leads to a performance degradation. It is obvious that high degree of gene interconnection adds information on the inference data, providing “forgiveness” to the lack of directly triggered genes.

E. Varying two parameters simultaneously.

n = 2%

(b)

D. Network Sparseness.

CR IP T

It is noticed that by simply diminishing the time separation, while maintaining no triggering overlap (time separation coefficient  [100%, 200%]), the inference performance is not affected. We also noticed that a limited triggering overlap is well handled without significant performance degradation. Α time separation coefficient less than 80% leads to noticeable performance degradation and as the overlap extends the performance gets poorer. It is obvious that time separation plays a crucial role in the performance of the inference algorithm.

Sensitivity Se

Fig. 10. P-R curves for the time separation triggering scenario with a time separation coefficient equal to 200%. The number of genes that are triggered varies from 2 to 20. The network size is 20. The pulse function triggering was used (a) in a fully to a fully connected network with sparseness equal to 0%, and, (b) in a sparse to a sparse network with sparseness equal to 50%.

F. Additive noise on the pulse stimulation The next parameter that was examined was the influence on the performance due to the presence of additive noise on the pulse stimulation. The network size was set to 20 genes. We set the number of directly triggered genes equal to 100%. Simulations

ACCEPTED MANUSCRIPT

were carried out using the pulse type triggering scenario with a time separation coefficient equal to 200%. The network sparseness is equal to 50%. Simulations were carried with noise levels varying from 0% to 10% sparseness set to 0% and 50%. The corresponding P-R curves are shown in Fig. 12. By comparing Fig. 12a to Fig. 12b it is obvious that sparse networks are more vulnerable to noise compared to fully connected networks.

G. Noise as triggering function The Gaussian noise type triggering function, shown in Fig. 5, was used replacing the step function. The network size was set to 20 genes. We varied the number of directly triggered genes from 2 to 20. The time separation coefficient was equal to 200%. The network sparseness was equal to 0%. The corresponding P-R curves are shown in Fig. 13. P-R curve (a)

(a)

5%

0% 1% 2%

0%

Sp

20%

ars

40%

ne ss

60% 80%

5

2

15

10

nes tly triggered ge Number of direc

20

(b)

0% 20% 40%

rsn ess

60%

10

80%

1

15

AC

Sensitivity Se P-R curve 0% 1% 2%

Accurancy PPV

5%

10%

20

5 genes iggered irectly tr d f o r e Numb

CE

Spa

PT

ED

AUPR curve

M

(b)

CR IP T

AN US

AUPR curve

Accurancy PPV

10%

Fig. 11. Area under the P-R curve (AUPR). The network size is 20. The pulse function triggering was used. (a) Time separation was set equal to 100%. The maximum AUPR is equal to 1. Decrease of sparseness or increase of the number of directly triggered genes leads to an increase of inference performance (b) Time separation was set equal to 0%. The maximum AUPR is around 0.65. Decrease of sparseness or increase of the number of directly triggered genes leads to an increase of inference performance.

Sensitivity Se

Fig. 12. P-R curves for various levels of noise (%) for (a) a fully connected network, and, (b) a sparse network consisting of 20 genes.

ACCEPTED MANUSCRIPT

Effect of the discrete to analog conversion algorithm.

Two things have been noticed. The first thing is that the Tustin’s method accuracy is inferior to the other two methods. This has been tested in various scenarios using different number of triggered genes (10% - 100%) and different sizes of GRNs (n=10, 20, 50, 100). A representative example of the obtained results is shown in Fig. 14 where a 20 gene fully connected GRN is used. 10 genes are triggered with a pulse triggering function. The time separation coefficient was equal to 200%. P-R PR curve curve 1

I. Inferred system poles position. n = 20, 19,18,15,12

It is noticed that as the number of the triggered genes that are deleted increases, the poles of the system that are equal

0.9

n = 10

0.8

The second thing is that as we diminish the number of the triggered genes there is an increase in the number of poles of the discrete inferred system that are near zero in the z-plan. The zoh and foh methods are known to be unable to cope with systems with poles near zero. In contrast the Tustin’s method can handle such cases. In our simulation we mainly used the zoh method. In cases where the system poles were close to zero leading to error messages for the zoh algorithm, the Tustin’s method was selected.

n=5 n=2

Number of near zero poles

0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4 0.6 False Sensitivity Positive Rate Se FPR

0.8

1

AN US

Accurancy Se Sensitivity PPV

0.7 0.6

CR IP T

H.

M

Fig. 13. P-R curve for the Gaussian noise triggering function type scenario.

Time separation coefficient was set to 200%. The number of genes that are triggered varies from 2 to 20. The gene network is fully connected.

Number of deleted genes

0.8

Tustin algorithm

PT

0.7

Number of near zero poles

Zoh and foh algorithms

0.9

0.6 0.5

CE

Se Sensitivity Accuracy PPV

ED

PR P-R curve curve 1

0.4 0.3

AC

0.2 0.1 0

0

0.2

0.4 0.6 False Positive Rate Sensitivity Se FPR

0.8

1

Fig. 14. . P-R curves for a fully connected network consisting of 20 genes using a number of triggered genes = 10 and a pulse shaped triggering function. The time separation coefficient was equal to 200%.

Number of not triggered genes

Fig. 15. (a) Relationship between the number of genes that are deleted and the number of the system poles that are equal to zero. (b) Relationship between the number of genes that are not triggered and the number of the system poles that are equal to zero. In both experiments, a network size equal to 20 and sparseness equal to 50% was used.

ACCEPTED MANUSCRIPT

triggered genes. The dataset that is created corresponds to discrete systems with poles near the origins. The inference performance is decreased.

Imaginary Axis

CR IP T

(a)

AC

CE

PT

Real Axis

(b)

Imaginary Axis

ED

M

AN US

to zero increase in a linear way. An example of a network with size equal to 20 and sparseness equal to 50% is shown in Fig. 15a. Using the same network we used a time separation triggering scenario with a time separation coefficient value equal to a 200%, a number of directly triggered genes that varied from 100% to 0% and we examined the pole behavior. The results are shown in Fig. 15b. As it is shown in Fig. 15b, decreasing the number of directly triggered genes in a sparse network has almost the same effect with increasing the number of deleted genes. Deleting genes is equal to proportionally decreasing the system’s order. Decreasing the number of directly triggered genes leads to datasets that correspond to systems of order lower than the real one. It should be noted that, in the mathematical formalization that has been used here, “system order” refers to the number of genes actively participating in the GRN. It’s obvious that it is impossible for a discrete to analog conversion algorithm to produce an accurate system since the data that it is working on are corresponding to systems of lower order. A snapshot of the pole plot during the decrease of the number of triggered genes is shown in Fig. 16. Fig. 16a corresponds to the case where the percentage of directly triggered genes is 100%. The inference performance is 100% (not shown). Fig. 16b corresponds to the case where the percentage of triggered genes is 40%. The inference performance has decreased (not shown). It’s evident that a decrease in the percentage of directly triggered genes leads to poles gathered near zero and inference performance degradation. We also tested various artificial networks by gradually deleting the interactions that a selected gene produces. For the ith gene, this is done by multiplying the elements of line (i) and column (i) in the connectivity matrix of the discrete system with a coefficient that takes values from 0 to 1. It was noticed that as the multiplying coefficient decreases, the pole that corresponds to the selected gene, is forced to move towards the origins. An example of a network with size equal to 10 and sparseness equal to 50% is shown in Fig 17. All the above pole placement study clarifies what happens in two cases: 1. Time separation triggering scenario with a time separation coefficient (>100%) and a percentage of triggered genes equal to 100%. The dataset that is created corresponds to discrete systems with all the poles away from the origins. The inference performance is excellent. 2. Time separation triggering scenario with a time separation coefficient (>100%) and a diminishing percentage of

Real Axis

Fig. 16. (a) Pole plot of the discrete system corresponding to a percentage of triggered genes equal to 100%, time separation equal to 200% and 100% inference performance. A 20 gene fully connected network was used. (b) Pole plot of the discrete system corresponding to a percentage of triggered genes equal to 40%, time separation equal to 200% and decreased inference performance. The same 20 gene fully connected network was used in both experiments.

The decrease in the percentage of triggered genes leads to datasets that contain information of poor quality corresponding to systems of lower order than the original one. This imposes great difficulties on the inference algorithm

ACCEPTED MANUSCRIPT

Performance calculated over trained outputs Performance calculated over all outputs

CR IP T

Imaginary Axis

pole movement

Final pole position

P-R curve

(a)

Normalized root mean square error Averaged over all outputs

which is forced to generate a system from low quality data, leading to poor performance.

Initial pole position

Number of triggered genes used in the inference process (b)

AN US

Real Axis

M

J. Time domain performance.

AC

CE

PT

ED

The performance curve for the first scenario, where the goodness of fitness (Eq. 3) was evaluated by triggering the same genes that were used for its inference, is shown in Fig. 18a. The curve is an almost straight line tagged as ‘performance calculated over trained outputs’. A ‘zoom in’ version of the curve is shown in Fig. 18b. The performance curve for the second scenario, where the goodness of fitness was evaluated by triggering all the genes, is also shown in Fig. 18a. The curve is tagged as ‘performance calculated over all outputs’ An important notice must be made. All inferred networks had excellent time response matching when they were triggered with the same triggering data that they were inferred. This means that when we don’t know the structure of a network and an inference process gives excellent time domain matching this doesn’t prove that the system has been accurately identified. Different systems are able to give the same output in the time domain under the same triggering. This is the case with microarray data. Different systems can produce the same gene expression levels under the same external perturbation.

Performance calculated over trained outputs

Normalized root mean square error Averaged over all outputs

Fig. 17. Pole plot of a 10 gene network. |The coefficients of the discrete connectivity matrix regarding one specific gene are gradually diminished. The initial poles are marked in black. The corresponding pole movement is marked with red.

P-R curve

Number of triggered genes used in the inference process

Fig. 18. (a) Average NRMSE over all inputs and average NRMSE over trained inputs (b) ‘Zoom in’ version of the average NRMSE over all inputs curve.

IV.

E. COLI SOS DNA REPAIR NETWORK

To test the different triggering scenarios in a real GRN, one should ideally employ a real network which fulfills the two following requirements: 1) It has already been identified, regarding both structure and network parameters. This will enable its use as a golden standard and will facilitate comparisons with inferred models. 2) Each gene has been triggered separately and the network response has been measured.

ACCEPTED MANUSCRIPT

Choe, 2008b). The model parameters and the corresponding connectivity matrix are shown in Fig. 20.

1 -0.17 -0.009 -0.037 -0.007 -0.037 -0.008 -0.009

CR IP T

x1(t+1) = x1(t) + x2(t); x2(t+1) = -0.17 x1(t) + 0.59 x2(t) + 0.084 u(t); x3(t+1) = 0.009 x1(t) + 0.81 x3(t); x4(t+1) = 0.037 x1(t) + 0.74 x4(t); x5(t+1) = -0.007 x1(t) + 0.664 x5(t); x6(t+1) = -0.037 x1(t) + 0.965 x6(t); x7(t+1) = 0.008 x1(t) + 0.697 x7(t); x8(t+1) = 0.009 x1(t) + 0.621 x8(t); 1 0.59 0 0 0 0 0 0

0 0 0.81 0 0 0 0 0

Fig. 20. The parameters and the corresponding connectivity matrix of a model corresponding to the SOS DNA repair network in bacterium Escherichia coli. (Xiong and Choe, 2008a).

In the SOS network, x1 is the expression level of gene lexA, x2 is the discretized 1st derivative of x1, x3 gene polB, x4 gene umuD, x5 gene uvrD, x6 gene uvrA, x7 gene uvrY, and x8 gene ruvA. The system’s sole input is gene recA. The diagram of the SOS network is shown in figure 21. The first line in the connectivity matrix (Fig. 20) is used to insert second order dynamics in the model in order to describe the time series more adequately. recA

PT

ED

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.74 0 0 0 0 0 0.964 0 0 0 0 0 0.965 0 0 0 0 0 0.697 0 0 0 0 0 0.621

lexA

CE

AC

Gene Expression

M

AN US

Since both requirements cannot be easily fulfilled, we chose to get as near as possible to an ideal GRN by employing the SOS DNA repair network in bacterium Escherichia coli. The SOS network is a well-studied network and the fact that a simplified model of it, based on real data, already exists is helpful. The real dataset contains only a single gene triggering scenario (gene recA directly triggers gene lexA). Thus, the different triggering scenarios and the corresponding datasets must inevitably be created by simulation. The SOS network constitutes a benchmark in the GRN inference problem using time datasets. Uri Alon and his group (Ronen et al., 2002) conducted experiments where they irradiated cultures with UV light in order to create DNA damage. They performed 4 experiments with two different light intensities (Experiments 1 and 2: UV 5Jm-2, Experiments 3 and 4: UV 20 Jm-2). The expression levels of eight major genes (uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA, and polB) were measured using a sampling rate equal to 1 measurement per 6 minutes and a final dataset consisting of 50 time points per gene in each experiment was built. Fig. 19 shows the expression levels of the eight SOS genes from experiment 1. LexA acts as a DNA damage sensor. When no damage is sensed, LexA represses the genes that are involved in the DNA repair process. When DNA damage is sensed, protein lexA binds to damaged single strand DNA, it becomes activated and leads to auto cleavage of LexA. Cleavage of LexA leads to upregulation of the SOS genes. After the repair of DNA the recA levels drop and lexA represses the SOS genes.

Time (min)

polB

umuD

uvrD

uvrA

uvrY

ruvA

Fig. 19. Measured gene expressions of the eight SOS genes in the E. coli SOS DNA repair network. Data from experiment 1 using an UV dose equal to 5Jm-2 is shown (Ronen et al., 2002).

Fig. 21. Diagram of the E. coli SOS DNA repair network. Under normal A state space model based on the Alon’s dataset was constructed by Xiong (Xiong and Choe, 2008a, Xiong and

conditions gene lexA represses the rest of the SOS network genes. Expression of gene recA acts as a sole network input.

ACCEPTED MANUSCRIPT

This is shown in Fig 23 where the corresponding P-R curve is equal to one and it is also shown in Fig. 24 where the inferred connectivity matrix is equal to the original matrix. 1 -0.17 -0.009 -0.037 -0.007 -0.037 -0.008 -0.009

1 0.59 0 0 0 0 0 0

0 0 0.81 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0.74 0 0 0 0 0.964 0 0 0 0 0.965 0 0 0 0 0.697 0 0 0 0

CR IP T

The pole plot of the E. coli SOS DNA repair network is shown in Fig. 22.

0 0 0 0 0 0 0 0.621

Fig. 24. Connectivity matrix corresponding to the inferred system. All 7 genes were triggered using time separation and 100% identification performance was achieved.

Fig. 22. Pole plot of the E. coli SOS DNA repair network. (7-gene model)

PR curve

n=6

PT n=4

T > 120%

T = 110% T = 105%

T = 20% T = 100%

T = 60%

T = 0%

n=3

CE

Accurancy PPV

n=5

ED

n=7

P-R curve

Accurancy PPV

M

AN US

We used the 7-gene model as a reference model and applied different triggering scenarios. Firstly, only one gene was triggered (lexA) using a 200% time separation coefficient. Afterwards we increased the number of directly triggered genes. The estimated P-R curves are shown in Fig. 23. It must be noted that the experimental dataset corresponds to a single gene triggering scenario in which gene lexA is directly triggered by external stimulation (gene recA). It is clear that inference performance increases when additional genes are triggered using time separation. When all genes are triggered, 100% identification can be accomplished.

We also varied the time separation coefficient from 0% to 200% while triggering all 7 genes. The relevant P-R curves are shown in Fig 25. As seen, when the time separation is higher than 120% the inference performance is maximum. However, a time separation coefficient less than 120% leads to significant performance degradation. It is obvious, therefore, that time separation plays a crucial role in the performance of the inference algorithm.

n=2

AC

n=1

Sensitivity Se

Fig. 25. P-R curves for the E. coli SOS DNA repair network. All 7 genes were triggered using time separation. When time separation is higher than 120% the inference performance is perfect. Α time separation coefficient (T) less than 120% leads to noticeable performance degradation.

Sensitivity Se

Fig. 23. P-R curves for the E. coli SOS DNA repair network. The curve with only one triggered gene (recA) corresponds to inference based on only the experimental data set. When additional genes are triggered using 200% time separation the inference performance increases. When all genes are triggered using 100% time separation 100% inference performance is accomplished.

Resultant pole plots are shown in Fig. 26. When all 7 genes were triggered and a time separation coefficient equal to 200% was used, a perfect identification was achieved. Fig. 26a shows the corresponding pole plot which is identical with the pole plot of the original system (Fig. 22). When only one gene was triggered, there was poor inference per-

ACCEPTED MANUSCRIPT

formance and 3 poles moved around zero as shown in Fig 26b. This denotes loss of valuable information in the dataset.

When all 7 genes were triggered but with no time separation, there was poor inference performance and 3 poles moved around zero as shown in Fig 26c. This practically denotes that information regarding three genes was lost.

(a)

V. INFERENCE USING DIFFERENT INITIAL CONDITIONS

CR IP T

Time series data used for GRN inference can be created by either applying an external stimulation on a GRN or by using a set of different initial conditions that creates the gene expression time series (Wang et al., 2006). Initial conditions can be imposed by two ways: 1. A number of different levels of external stimulation can be applied to the GRN. This way the different initial conditions affect the network through the set of directly genes. 2. Initial conditions are randomly chosen for each gene in the network. This way, the different initial conditions directly affect all the network genes without taking into account which genes are directly triggered. In this case, time series are created by imposing random initial conditions to all the network genes and the network response is simulated until a steady state is reached. In the current experiments, the network size was set to 20 genes. The proportion of directly triggered genes was equal to 10%. Simulations were carried out using the pulse type triggering scenario with a time separation coefficient equal to 0%. The network sparseness was set equal to 50%. For the first case, 100 different levels of external stimulation were used. For the second case, simulations were carried using a variable number of datasets, ranging from 1 to 300, each set corresponding to different initial conditions.

AC

CE

PT

(c)

Fig. 26. (a) Pole plot of the inferred SOS DNA repair network. All genes were triggered using time separation. Inferred system is identical to the original one. (b) Pole plot of the inferred SOS DNA repair network. Only one gene was triggered (lexA). Three poles are equal to zero denoting the loss of information in the dataset. (c) Pole plot of the inferred SOS DNA repair network. All genes were triggered using no time separation. Again, three poles are equal to zero denoting the loss of information in the dataset.

P-R curve

Accurancy PPV

ED

M

AN US

(b)

(b) (a)

Sensitivity Se

Fig.27. (a) P-R curve using a number of different levels of external stimulation. The network size was set equal to 20. The time separation coefficient was equal to 0%. Percentage of directly triggered genes was equal to 10%. The different levels of external stimulation were set to 100. (b) Pulse triggering scenario using the same network.

ACCEPTED MANUSCRIPT

n > 200 n = 12

n = 20

Accurancy PPV

n = 11

n = 10 n=5 n=1

Sensitivity Se

AN US

Fig. 28. P-R curves using random initial conditions. The network size was set equal to 20. The time separation coefficient was equal to 0%. Percentage of directly triggered genes was equal to 10%. The number of datasets with different initial conditions (n) was varied from 1 to 300.

mance of the algorithm. In addition, by increasing the number of genes that are directly triggered, the inference performance is also improved. Also, sparse networks are shown to be highly affected by noisy triggering compared to fully connected networks. The Gaussian noise triggering leads to better performance in contrast to the pulse function triggering. Finally, the network sparseness imposed difficulties on the inference algorithm both in noiseless and noisy environments. When the same simulation parameters with Bansal (Bansal et al., 2006) were used, the results were very close. Altering two factors at the same time allowed us to create 3D plots of the inference performance. As shown in Fig. 11, decrease of sparseness or increase of the number of directly triggered genes leads to an increase of the inference performance. However, the sparseness may significantly suppress the dynamic range of the effect of the number of directly triggered genes on the inference performance highlighting the interrelationship between the various contributing factors. Different initial conditions were used as an alternative approach. When different external conditions were employed, forcing the triggering information to pass through the directly triggered genes, the inference performance was poor. When random initial conditions were chosen, bypassing the directly triggered genes, the inference performance was improved but was not maximzsed even with an increased number of datasets. The effect of a triggering scenario that uses time separation and a variable number of directly triggered genes was applied on a real GRN, the SOS DNA repair network in bacterium Escherichia coli. The results were found to be in good agreement with the results produced by artificial networks, substantiating the initial hypothesis of this study that artificial microarray data can adequately support the inference process. It should be noted, however, that in a real life microarray experiment an external perturbation directly triggers only a small portion of the total number of network genes. The rest of the genes are indirectly stimulated through the connectivity matrix A. Another fact is that the directly triggered genes do not follow the time-separated multi-gene scenario. Also GRNs are usually sparse. Finally, the external perturbation cannot follow a Gaussian pattern due to cell response time limitations. We come to the conclusion that microarray experiments provide data with insufficient amount of information for accurate network identification. This information gap must be bridged by alternative input–output identification experiments or alternative data sources. In this sense, findings of this study may provide experimentalists with a different methodological approach which could lead to an increased

CR IP T

P-R curve

CONCLUSIONS

PT

VI.

ED

M

The inference performance for the two cases is shown in Figures 27 and 28 respectively. As noticed in Fig. 27, the different levels of external stimulation (a) do not favour the inference process and a slightly better accuracy is achieved even by the basic pulse triggering scenario examined above (b). When random initial conditions are used, it is seen that increasing the number of datasets greatly improves the inference performance but a saturation effect is observed when the number of datasets gets higher than 200 (Fig 28).

AC

CE

We examined the adequacy of artificial microarray data to support a specific GRN inference algorithm. We used a time separation triggering scenario and focused on five key points. 1. The effect of the percentage of genes that are directly triggered. 2. The effect of time separation between individual gene triggering pulses. 3. The effect of different noise levels on both sparse and fully connected networks. 4. The effect of the use of Gaussian noise as triggering in the multi-gene triggering scenario. 5. The effect of network sparseness on the inference performance. Results obtained by extensive simulations show that separation of gene triggering pulses improves the perfor-

ACCEPTED MANUSCRIPT

N

a

ij

j=1

N

x j (tk ) +

b

ij

a

discrete

ij

N

x j (tk ) +

j=1

b

discrete

ij

uj (tk )

(A3)

j=1

The discrete state space model is 

X(tk + 1) = A d X(tk ) + Bd U(tk )

 X(t ) X (tk + 1 ) = [A d Bd ] *  k  U(tk )

AN US

A GRN can be described by a system of ordinary differential equations (Bansal et al., 2006): 

N

xi (tk + 1 ) =

k = 1 ... M

(A4)

It should be not noted that the continuous and the discrete state space model are different models. We can rewrite Eq. (A4) as:

APPENDIX. THE TSNI ALGORITHM

xi (tk ) =

sampled data and no transformation is needed when used in a discrete state space model. The system of ordinary differential equations is transformed into a system of ordinary difference equations as shown in Eq.(A3)

CR IP T

number of datasets encompassed in the inference process and, thus, to a more accurate network identification outcome. The simplicity of the state space model used for the GRN representation is appealing although it may suffer certain limitations. As a future work it would be interesting to evaluate the inference performance in two cases: 1) when units of time delay are inserted in the artificial microarray data and 2) when nonlinear behavior in the gene interconnections is used in the creation of the artificial microarray data. The integration of prior biological knowledge (e.g. known gene interactions) in the inference process constitutes another interesting aspect to be studied.

uj (tk )

(A1)

j=1

k = 1 ... M

(A5)

Which can be written as: X=H*Y

(A6)

where xi is the concentration of transcript i measured at 

where

A nonzero value denotes that the corresponding gene is a directly triggered gene and a zero value denotes that it is not a directly triggered gene. Coefficient uj (tk ) represents the jth

The system in Eq.(A6) is solved using the pseudo-inverse matrix of Y. The discrete system can be converted back to its continuous form by the use of various techniques such as the bilinear transformation (Tustin’s method), the zero order hold (zoh method) and the first order method. The algorithm can be downloaded, written in Matlab code, from the di Bernardo Lab website (http://dibernardo.tigem.it).

ED

M

time tk, x i is the rate of change or the first derivative of transcript i at time tk, aij is the influence of gene j on gene i. A positive, negative or zero value of aij respectively corresponds to activation, repression or no influence. Coefficient bij represents the influence of the jth perturbation on gene i.



k = 1 ... M

CE

X(tk ) = A X(tk ) + B U(tk )

PT

perturbation at time tk. N is the number of network genes. We can use the state space formalism and rewrite Eq. (A1) as a continuous state space model: (A2) 

AC

where X(tk) is the concentration vector, X(tk ) is the vector that contains the first derivative of all transcripts at time tk, A is the connectivity matrix that contains all aij coefficients, B is the perturbation matrix that contains all bij coefficients. U is the perturbation vector that contains perturbations applied to every gene at time tk . M is the number of sample points. The continuous state space model is transformed in a discrete state space model for 2 reasons. Firstly, we avoid estimating time derivatives of gene expressions that increase the noise level. Secondly gene expression profiles are

 X H  [A d Bd ] and Y    U 

ACCEPTED MANUSCRIPT

Bansal M, Gatta G, di Bernardo D (2006) Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22:815-822. doi: 10.1093/bioinformatics/btl003. Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20:2493-2503. doi: 10.1093/bioinformatics/ bth283. Bosch, P. and Klauw, A. (1994). Modeling, identification, and simulation of dynamical systems. Boca Raton: CRC Press. Castilla Maria del Mar, 2014. Comfort Control in Buildings (Advances in Industrial Control). 2014 Edition. Springer.

Rajamani Doraiswami, 2014. Identification of Physical Systems: Applications to Condition Monitoring, Fault Diagnosis, Soft Sensor and Controller Design. 1 Edition. Wiley. Ristevski B (2013) A survey of models for inference of gene regulatory networks, Nonlinear Anal. Modell. Control 18 (4). Ronen, M., Rosenberg, R., Shraiman, B. and Alon, U. (2002). Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics. Proceedings of the National Academy of Sciences, 99(16), pp.10555-10560. Sauro H (2011) Enzyme kinetics for systems biology. Ambrosius Publishing, Lexington, KY.

CR IP T

REFERENCES

Segal L, Lapidot M, Solan Z et al. (2007) Nucleotide variation of regulatory motifs may lead to distinct expression patterns. Bioinformatics 23:i440i449. doi: 10.1093/bioinformatics/ btm183.

Das, S. (2010). Handbook of research on computational methodologies in gene regulatory networks. Hershey, PA: Medical Information Science Reference.

Segel, L. (1980). Mathematical models in molecular and cellular biology. Cambridge [England]: Cambridge University Press.

Fawcett T (2006) An introduction to ROC analysis. Pattern Recognition Letters 27:861-874. doi: 10.1016/j.patrec.2005.10. 010.

Storey, J., Xiao, W., Leek, J., Tompkins, R. and Davis, R. (2005). Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102(36), pp.12837-12842.

AN US

Chen L, Wang R, Zhang X (2009) Biomolecular networks. Wiley, Hoboken, N.J.

Franklin, G. and Powell, J. (1980). Digital control of dynamic systems. Reading, Mass.: Addison-Wesley Pub. Co. Hecker M, Lambeck S, Toepfer S et al. (2009) Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems 96:86-103. doi: 10.1016/j.biosystems.2008. 12.004.

M

Hitoshi Iba, 2016. Evolutionary Computation in Gene Regulatory Network Research (Wiley Series in Bioinformatics). 1 Edition. Wiley. Ivanov, I., Qian, X. and Pal, R. (n.d.). Emerging research in the analysis and modeling of gene regulatory networks.

ED

Katayama, T. (2005). Subspace methods for system identification. London: Springer

PT

Klemm S.L. Master's thesis. UK: Department of Engineering, University of Cambridge; 2008. Causal Structure Identification in Nonlinear Dynamical Systems. Ljung, L. (1999). System identification. Upper Saddle River, NJ: Prentice Hall PTR.

CE

Lo, L., Wong, M., Lee, K. and Leung, K. (2015). High-order dynamic Bayesian Network learning with hidden common causes for causal gene regulatory network. BMC Bioinformatics, 16(1).

AC

Mundra, P. A., Zheng, J., Niranjan, M., Welsch, R. E., & Rajapakse, J. C. (2013). Inferring Time-Delayed Gene Regulatory Networks Using CrossCorrelation and Sparse Regression. Bioinformatics Research and Applications Lecture Notes in Computer Science, 64–75. Nelles, O. (2001). Nonlinear system identification. Berlin: Springer. Penfold C, Wild D (2011) How to infer gene networks from expression profiles, revisited. Interface Focus 1:857-870. doi: 10.1098/rsfs.2011.0053.

Wang, Y., Joshi, T., Zhang, X., Xu, D. and Chen, L. (2006). Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics, 22(19), pp.2413-2420. Xiong, H. and Choe, Y. (2008a). Structural systems identification of genetic regulatory networks. Bioinformatics, 24(4), pp.553-560.

Xiong, H. and Choe, Y. (2008b). Dynamical pathway analysis. BMC Systems Biology, 2(1), p.9. Yang, X., Dent, J. and Nardini, C. (2012). An S-System Parameter Estimation Method (SPEM) for Biological Networks. Journal of Computational Biology, 19(2), pp.175-187 Yang, W. (2009). Signals and systems with MATLAB. Berlin: SpringerVerlag. Zhu, Y. (2001). Multivariable system identification for process control. Amsterdam: Pergamon.

Corresponding author: Author: Institute: Street: City: Country: Email:

P. Xenitidis School of Medicine, Democritus University of Thrace University Campus, Dragana Alexandroupolis Greece [email protected]