Dynamic Transcriptomic Data Reveal Unexpected Regulatory Behavior of Scheffersomyces stipitis

Dynamic Transcriptomic Data Reveal Unexpected Regulatory Behavior of Scheffersomyces stipitis

12th IFAC Symposium on Dynamics and Control of Process including Biosystems 12th IFACSystems, Symposium on Dynamics and Control of 12th IFAC Symposium...

1MB Sizes 0 Downloads 22 Views

12th IFAC Symposium on Dynamics and Control of Process including Biosystems 12th IFACSystems, Symposium on Dynamics and Control of 12th IFAC Symposium on Control of Available at www.sciencedirect.com 12th IFACSystems, Symposium on Dynamics Dynamics and Controlonline of Florianópolis - SC,including Brazil, April 23-26,and 2019 Process Biosystems Process including Biosystems 12th IFACSystems, Symposium on Dynamics and Control of Process Systems, including Biosystems Florianópolis SC, Brazil, April 23-26, 2019 Florianópolis -- SC, Brazil, 23-26, Process Systems, Biosystems Florianópolis SC,including Brazil, April April 23-26, 2019 2019 Florianópolis - SC, Brazil, April 23-26, 2019

ScienceDirect

IFAC PapersOnLine 52-1 (2019) 538–543

Dynamic Transcriptomic Data Reveal Unexpected Regulatory Behavior of Dynamic Transcriptomic Scheffersomyces Data Reveal Unexpected Behavior of stipitis Regulatory Dynamic Regulatory Dynamic Transcriptomic Transcriptomic Data Data Reveal Reveal Unexpected Unexpected Regulatory Behavior Behavior of of Dynamic Transcriptomic Scheffersomyces Data Reveal Unexpected Regulatory Behavior of stipitis Scheffersomyces stipitis Scheffersomyces stipitis * Scheffersomyces stipitis Matthew Hilliard*. Q. Peter He . Jin Wang* * Matthew Hilliard****. Q. Peter He****. Jin Wang* * *

Matthew Hilliard .. Q. He Wang Matthew Hilliard Auburn Q. Peter Peter He*.. Jin Jin WangAL 36849 USA *Department of Chemical Engineering, University, Auburn, *. Jin Wang* * Matthew Hilliard**. Q. Peter He  *Department of Chemical Engineering, Auburn University, Auburn, AL 36849 USA (MH:*Department [email protected]; QPH: [email protected]; JW: [email protected], 334-844-2020) of Chemical Engineering, Auburn University, Auburn, AL *Department of Chemical Engineering, Auburn University, Auburn, AL 36849 36849 USA USA  (MH: [email protected]; [email protected]; JW: [email protected], 334-844-2020) *Department of ChemicalQPH: Engineering, Auburn University, Auburn, AL 36849 USA (MH: [email protected]; QPH: [email protected]; JW: [email protected], 334-844-2020) (MH: [email protected]; QPH: [email protected]; JW: [email protected], 334-844-2020) Abstract: The rapid developments in “omics” technologies offer unprecedented opportunities to help (MH: [email protected]; QPH: [email protected]; JW: [email protected], 334-844-2020) understand cellular metabolisms at genome-scale. Among different “omics” data, RNA-seq based

Abstract: The rapid developments in “omics” technologies offer unprecedented opportunities to help Abstract: The rapid developments in “omics” technologies offer opportunities to help Abstract: The rapid metabolisms developments in to “omics” technologies offer unprecedented unprecedented opportunities toof help transcriptome profiling has been used enhance the understanding of the genome-scale response the understand cellular based at genome-scale. Among different “omics” data, RNA-seq Abstract: The rapid developments in “omics” technologies offer unprecedented opportunities to help understand cellular metabolisms at genome-scale. Among different “omics” data, RNA-seq based understand cellular metabolisms at genome-scale. Among different “omics” data, RNA-seq based organism to different stimuli. While these studies have provided insightful findings, they are most often transcriptome profiling has been used to enhance the understanding of the genome-scale response of the understand cellular metabolisms at genome-scale. Among different “omics” data, RNA-seq based transcriptome profiling profiling has has been been used used to to enhance enhance the the understanding understanding of of the the genome-scale genome-scale response response of of the the transcriptome limited to to studying two-steady-state conditions. From aprovided control perspective, as cellular metabolism isthea organism different stimuli. While studies have insightful findings, they are most often transcriptome profiling has been usedthese to enhance the understanding of the genome-scale response of organism to different stimuli. While these studies have provided insightful findings, they are most often organism to different stimuli. While these studies have provided insightful findings, they are most often highly complex dynamic system, the conditions. transient response offer significantly more information theaa limited to studying two-steady-state From control perspective, as cellular metabolism is organism to different stimuli. While these studies haveaaacould provided insightful findings, they are moston often limited to studying two-steady-state conditions. From control perspective, as cellular metabolism is limited to studying two-steady-state conditions. From control perspective, asthis cellular metabolism is a cellular metabolism, particularly on potential gene regulatory mechanisms. In work, we present our highly complex dynamic system, the transient response could offer significantly more information on the limited to studying two-steady-state conditions. From acould control perspective, as more cellular metabolism isthe a highly complex dynamic system, the transient response offer significantly information on highly complex dynamic system, the transient response could offer significantly more information on the initial results ondynamic analyzing the dynamic transcriptomic profiles ofsignificantly Scheffersomyces stipitis during the cellular metabolism, particularly on potential gene regulatory mechanisms. In this work, we present our highly complex system, the transient response could offer more information on the cellular metabolism, particularly on potential gene regulatory mechanisms. In this work, we present our cellular metabolism, particularly on potential gene regulatory mechanisms. In thisthat work, present our transition fromon aerobic growth todynamic oxygen-limited fermentation. Ourof results confirm the we transient states initial results analyzing the transcriptomic profiles Scheffersomyces stipitis during the cellular metabolism, particularly on potential gene regulatory mechanisms. In this work, we present our initial results on analyzing the dynamic transcriptomic profiles of Scheffersomyces stipitis during the initial results on analyzing the dynamic transcriptomic profiles of Scheffersomyces stipitis during the indeed offer important and significantly more information than that provided by the steady-states alone. In transition from aerobic growth to oxygen-limited fermentation. Our results confirm that the transient states initial results on analyzing the dynamic transcriptomic profiles of Scheffersomyces stipitis during the transition from aerobic growth to oxygen-limited fermentation. Our results confirm that the transient states transition from aerobic growth to oxygen-limited fermentation. Our results confirm that the transient states addition, the transient transcriptomic data revealed interesting regulatory behavior that was not expected indeed offer important and significantly more information than that provided by the steady-states alone. In transition from aerobic growth to oxygen-limited fermentation. Our results confirm that the transient states indeed offer important and significantly more information than that by alone. indeed offer important andcrucial significantly more information thancellular that provided provided by the the steady-states steady-states alone. In In before, which could offer insight inrevealed understanding the metabolism. addition, the transient transcriptomic data interesting behavior was not expected indeed offer important and significantly more information than regulatory that provided by the that steady-states alone. In addition, the transient transcriptomic data revealed interesting regulatory behavior that was not expected addition, the transient transcriptomic data revealed interesting regulatory behavior that was not expected before, which could offer crucial insight in understanding the cellular metabolism. addition, the transcriptome transient transcriptomic data interesting regulatory behavior that was not response, expected before, which could crucial insight in understanding the cellular metabolism. before, which could offer offer crucial insight inrevealed understanding the analysis, cellular metabolism. Keywords: profiling, transcriptomic data steady-state, transient before, which could offer crucial insight in understanding the cellular metabolism. © 2019, IFACtranscriptome (International Federation of Automatic Control) Hosting bymetabolic Elsevier Ltd. All rights reserved. Scheffersomyces stipites, principal component analysis, genome-scale model Keywords: profiling, transcriptomic data analysis, steady-state, transient response, Keywords: transcriptome profiling, transcriptomic data analysis, steady-state, transient Keywords: transcriptome profiling, transcriptomic data analysis, metabolic steady-state, transient response, response, Scheffersomyces stipites, principal component analysis, genome-scale model Keywords: transcriptome profiling, transcriptomic analysis, metabolic steady-state, Scheffersomyces stipites, component analysis, genome-scale model Scheffersomyces stipites, principal principal component analysis, data genome-scale metabolic modeltransient response, Scheffersomyces stipites, principal component analysis,completely genome-scale metabolic model that only show differential misses the genes expression transition the show two steady-states, completely misses genes that differential completely during missesthethe the genes between that only only show differential 1. INTRODUCTION completely misses the genes that only show differential which could offer important insights on the regulatory expression during the transition between the two steady-states, completely misses the genes that only show differential expression during the the transition transition between between the the two two steady-states, steady-states, 1. INTRODUCTION expression during 1. INTRODUCTION mechanism of cellular metabolism. which could offer important insights on the regulatory 1. INTRODUCTION Recently, the rapid advancements in “omics” (e.g., genomics, which expression during the important transition between steady-states, which could could offer important insightstheon ontwothe the regulatory offer insights regulatory 1. INTRODUCTION mechanism of cellular metabolism. profiling transcriptomics, proteomics and metabolomics) which could offer important insights on the mechanism of cellular metabolism. Recently, the rapid advancements in “omics” (e.g., genomics, mechanism cellular metabolism. Recently, the rapid advancements in “omics” (e.g., It has been of recognized that cellular metabolism isregulatory a highly Recently, the have rapidproteomics advancements inmetabolomics) “omics” (e.g., genomics, genomics, technologies enabled large amount of “omics” data to be mechanism of cellular metabolism. transcriptomics, and profiling Recently, the rapid advancements in “omics” (e.g., genomics, transcriptomics, proteomics and metabolomics) profiling complex dynamic system. From a control perspective, the It has been recognized that cellular metabolism is highly transcriptomics, and profiling It has has been been recognized recognized that that cellular cellular metabolism metabolism is is aaa highly highly collected to have helpproteomics understand the metabolomics) cellular metabolism at It technologies enabled large amount of “omics” data to be transcriptomics, proteomics and metabolomics) profiling technologies have enabled large amount of “omics” data to be transient trajectory of a dynamic system could offer complex dynamic system. From a control perspective, the technologies have largepromising amount of‘omics’ “omics” data to be It has been recognized thatFrom cellular metabolism is a highly complex dynamic system. From control perspective, the genome-scale. Oneenabled of the most data sources collected help understand the metabolism at dynamic system. aa control perspective, the technologies enabled large amount of “omics” data to be collected to to have help understand the cellular cellular metabolism at complex significantly more information on the system dynamics, transient trajectory of a dynamic system could offer collected to help understand the cellular metabolism at complex dynamic system. From a control perspective, the transient trajectory of a dynamic system could offer enhancement is for genome-scale metabolic model (GEM) genome-scale. One the ‘omics’ data trajectory ofcontrol a dynamic system could offer collected to help understand the cellular metabolism at transient genome-scale. One of of the most most promising promising ‘omics’ data sources sources especially themore internal mechanism. As illustrated in significantly information on the system dynamics, genome-scale. One of the most promising ‘omics’ data sources transient trajectory of a dynamic system could offer significantly more information on the system dynamics, transcriptomic data. The development of high-throughput for genome-scale metabolic model (GEM) enhancement is significantly more information onthethe system dynamics, genome-scale. Onemetabolic of the mostmodel promising ‘omics’ data sources for genome-scale (GEM) enhancement is Fig. 1, for two systems that share same initial and final especially the internal control mechanism. As illustrated in for genome-scale metabolic model (GEM) enhancement is significantly more information on the system dynamics, especially the internal control mechanism. As illustrated in provide accurate sequencing technologies such as RNA-Seq transcriptomic data. The development of high-throughput the internal control mechanism. As illustrated in for genome-scale metabolic model (GEM) enhancement is especially transcriptomic data. The development of high-throughput steady-states, steady-states alone cannot differentiate the two Fig. 1, for two systems that share the same initial and final transcriptomic data. The development of high-throughput especially the internal control mechanism. As illustrated in Fig. 1, for two systems that share the same initial and final quantification of gene expression levels which can be utilized sequencing technologies such as RNA-Seq provide accurate Fig. 1, for two systems that share the same initial and final transcriptomic data. Thesuch development of provide high-throughput sequencing technologies as RNA-Seq accurate systems; issteady-states the transient response that offers steady-states, alone cannot differentiate the two sequencing technologies such as RNA-Seq provide accurate 1, forit two systems thatalone share the same initial necessary and steady-states, steady-states alone cannot differentiate thefinal two to gain a better understanding oflevels the genetic response to an Fig. steady-states, steady-states cannot differentiate the two quantification of gene expression which can be utilized sequencing technologies such as RNA-Seq provide accurate quantification of gene expression levels which can be utilized information to characterize the system dynamics accurately systems; it is the transient response that offers necessary quantification of gene expression levels which can be utilized steady-states, steady-states alone cannot differentiate the two systems; it is the transient response that offers necessary external employed by an organism. While RNA-Seq systems; it is the response that offers accurately necessary to gain aastimulus better of the genetic response to an quantification of understanding gene expression levels which can be utilized to gain better understanding of the genetic response to an and differentiate thetransient two systems. information to characterize the system dynamics to gain a better understanding of the genetic response to an systems; it is the transient response that offers necessary information to characterize the system dynamics accurately provides high gene by expression data for a model, to characterize the system dynamics accurately external employed an While RNA-Seq to gain astimulus betterquality understanding of organism. the genetic response to the an information external stimulus employed by organism. While RNA-Seq and differentiate the two systems. external stimulus employed by an an organism. While RNA-Seq information to characterize the system dynamics accurately and differentiate the two systems. relationship between gene expression levels and metabolic and differentiate the two systems. provides high quality gene data for a model, the external stimulus employed by an organism. While RNA-Seq provides high quality gene expression for the provides quality gene expression data datamodifications for aa model, model, and the and differentiate the two systems. flux is nothigh direct due gene to post-translational relationship between expression levels metabolic provides high quality gene data forand a model, the relationship between gene expression levels and metabolic relationship between gene expression levels and metabolic other regulatory activities within the cell. Nevertheless, the flux is not direct due to post-translational modifications and relationship between gene expression levels and metabolic flux is not direct due to post-translational modifications and flux isregulatory not direct due to post-translational modifications and data captured by RNA-Seq analysis provides an opportunity to other activities within the cell. Nevertheless, the flux isregulatory not direct activities due to post-translational modifications and other within the cell. Nevertheless, the other regulatory activities within thecells cell.respond Nevertheless, the how to genetic gain a better understanding on data captured by RNA-Seq analysis provides an opportunity to other regulatory activities analysis within the cell. Nevertheless, data captured by provides an to data captured by RNA-Seq RNA-Seq analysis provides an opportunity opportunitythe to and/or perturbations atcells genome-scale. gain aa environmental better understanding on how respond to genetic data captured by RNA-Seq analysis provides an opportunity to gain better understanding on how cells respond to genetic gain a better understanding on how cells respond to genetic and/or environmental perturbations at genome-scale. gain a better understanding on how cells respond to genetic and/or environmental perturbations at genome-scale. and/or environmental at agenome-scale. analysis hasperturbations been used in variety of studies for RNA-Seq and/or environmental perturbations at genome-scale. gaining information regarding the genetic responses of the RNA-Seq analysis has been used in a of studies for RNA-Seq analysis has been in aa variety variety of for RNA-Seq analysis has been used used ingenetic variety of studies studies for (Ma et al., organisms to environmental/genetic perturbations gaining information regarding the responses of the RNA-Seq analysis has been used in a variety of studies for Fig. 1. Example of a cellular state transitioning between two gaining information information regarding regarding the the genetic genetic responses responses of of the the gaining 2017, Rong-Mullins et al., 2018). While these studies have organisms to environmental/genetic perturbations (Ma et al., gaining information regarding the genetic responses of the organisms to perturbations (Ma et al., viaof two different transient responses. organisms to environmental/genetic environmental/genetic perturbations (Ma ethave al., Fig. between two 1. Example aa cellular state transitioning provided insightful findings, they are most often limited to steady-states 2017, Rong-Mullins et al., 2018). While these studies Fig. 1. of state transitioning between organisms to environmental/genetic perturbations (Ma ethave al., Fig. 1. Example Example of a cellular cellular state transitioning between two two 2017, Rong-Mullins et al., 2018). While these studies steady-states via two different transient responses. 2017, Rong-Mullins et al., 2018). While these studies have Fig. 1. Example of a cellular state transitioning between two studying two-steady-state conditions. By comparing the whole steady-states via two different transient responses. provided insightful findings, they are most often limited to steady-states via two different transient responses. 2017, Rong-Mullins et al., 2018). While these studies have provided insightful findings, they are most often limited to this work,via guided by principles in control engineering, we provided insightful findings, theytwo areBy most often limited to In steady-states two different transient responses. one can gene expression profiles at the steady-states, studying two-steady-state conditions. comparing the whole provided insightful findings, they areBy most often limited to In studying two-steady-state conditions. comparing the whole focus on measuring and analysing the dynamic transient studying two-steady-state conditions. By comparing this work, guided by principles in control engineering, we identify the key profiles genes that show large changesthe inwhole gene In this work, guided by principles in control engineering, we gene expression at the two steady-states, one can studying two-steady-state conditions. By comparing the whole In this on work, guidedmetabolism byand principles in control engineering, we gene expression profiles at the two steady-states, one can response of cellular to an environmental stimulus gene expression profiles at the two steady-states, one can focus measuring analysing the dynamic transient In this work, guided by principles in control engineering, we expression level and postulate how these changes are related focus on measuring and analysing the dynamic transient identify the key genes that show large changes in gene gene expression profiles at the two steady-states, one can focus on measuring and analysing the dynamic transient identify the key genes that show large changes in gene to gain better understanding of the potential regulatory identify the key genes that show large changes in gene response of cellular metabolism to an environmental stimulus focus on measuring and analysing the dynamic transient possible gene are regulatory to each other; and therefore, of cellular metabolism to an environmental stimulus expression and postulate how these related identify thelevel key genes that infer show largechanges changes gene response response of cellular metabolism anthe environmental stimulus expression level and postulate how these changes are mechanisms. Specifically, usingto Scheffersomyces stipitis (an expression level and postulate howsteady-state these changes areinrelated related to gain better potential regulatory understanding of response of cellular metabolism to anthe environmental stimulus mechanisms. However, such comparison to gain better understanding of potential regulatory to each other; and therefore, infer possible gene regulatory expression level and postulate how these changes are related to gain better understanding of the potential regulatory to each other; and therefore, infer possible gene regulatory to each other; and therefore, infer possible gene regulatory mechanisms. Specifically, using Scheffersomyces stipitis (an gain better understanding the potential stipitis regulatory mechanisms. Specifically, using (an mechanisms. However, such to each other; and therefore, infer steady-state possible genecomparison regulatory to mechanisms. Specifically, usingofScheffersomyces Scheffersomyces stipitis (an mechanisms. However, such steady-state comparison mechanisms. However, such steady-state comparison mechanisms. Specifically, using Scheffersomyces stipitis (an Copyright © 2019, 2019However, IFAC 538Hosting 2405-8963 © IFAC (International of Automatic Control) by Elsevier Ltd. All rights reserved. mechanisms. such Federation steady-state comparison Peer review©under of International Federation of Automatic 2019 responsibility IFAC 538Control. Copyright Copyright © 538 10.1016/j.ifacol.2019.06.118 Copyright © 2019 2019 IFAC IFAC 538 Copyright © 2019 IFAC 538

2019 IFAC DYCOPS Florianópolis - SC, Brazil, April 23-26, 2019Matthew Hilliard et al. / IFAC PapersOnLine 52-1 (2019) 538–543

important yeast in biorenewables) as the model organism, we designed experiments to obtain its transcriptomic profiles during the transition from aerobic growth to oxygen-limited fermentation, and analysed the dynamic transcriptomic data to gain better understanding on its cellular metabolism. As existing methods are mainly designed to perform comparison analysis between two steady-states, how to effectively analyse dynamic transcriptomic data presents significant challenges. In this work, to address the complexity and scale of dynamic transcriptomic data, we integrate multivariate statistical tools developed for monitoring large-scale complex dynamic industrial processes with bioinformatics tools to analyse the dynamic transcriptomic data, and present the preliminary results we have obtained so far.

539

2.2 Sample Collection and RNA-Seq Analysis A total of 22 cell samples were collected for each experiment. Each cell sample was centrifuged for 5 minutes at 10000 rpm at room temperature. After centrifugation, the samples were decanted and the remaining cell pellets were immediately submerged into liquid nitrogen. The frozen cell samples were stored in a -80oC freezer until all cell samples from the four trials (88 samples in total) were collected. The frozen cell samples were shipped to University of Wisconsin-Madison Biotechnology Center, which conducted the RNA extraction, and analysed the samples through Illumina Next-Generation Sequencing (DNA Sequencing, 2018). The data set consists of raw count data, fragments per kilobase per million reads (FPKM), and transcripts per million (TPM). The raw reads were aligned to the CBS6054 genome using STAR aligner, and the aligned reads were quantified using RSEM (Li & Dewey, 2011; Dobin et al., 2013). After alignment, each sample contains 5966 gene expression levels. Thus, for each trial, we have a 5966x22 matrix of raw count data, FPKM value, and TPM values.

2. WET LAB EXPERIMENTS This section provides an overview of the experimental design, data acquisition and pre-processing.

2.1 Experimental Design 2.3 Data Pre-processing Pipeline

All experiments were conducted in a Bio Flo 115 with a working volume of 2 L using S. stipitis CBS5773. After inoculation, the cells were cultured under batch mode until biomass reached or exceeded 6 g/L. Then the culture was switched to continuous operation; after a controlled chemostat of aerobic growth was achieved and maintained, two samples were collected at a ten minute interval. The cells were determined to be at aerobic growth when the only products detected were biomass and carbon dioxide. Next, the oxygen supply was immediately and significantly reduced in order to induce oxygen-limited fermentation. In addition, the dilution rate was simultaneously adjusted in order to maintain a relatively constant cell density in the reactor. After the oxygen limitation was introduced, 18 samples were collected at ten minute intervals for three hours during the transition period. Finally, after the cells were maintained at the oxygen-limited steady-state for 24 hours, two more samples were collected at ten minute intervals for the new, oxygen-limited steady-state. Fig. 2 shows the biomass level in the bioreactor during one experimental run, together with the feeding gas oxygen concentration. The experiment was repeated four times.

The primary goal of the wet lab experiments is to identify the differentially expressed genes (DiffEpr) during the transition from aerobic growth to oxygen-limited fermentation. The ultimate goal is to gain better understanding on how the metabolic network responds to the perturbation and to postulate potential gene regulation mechanisms that govern the transition process. In this work, we primarily focus on analysing the TPM data. Fig. 3 provides the workflow for the RNA-Seq data analysis pipeline, which consists of six steps as described below, and Table 1 shows the number of genes after different steps of pre-processing: Organize Count Data (Count, CPM, TPM, FPKM)

Step 6 Data Correction

Step 1 Data Cleaning

Step 5 Remove Activated Genes

Step 2 Data Smoothing

Step 4 DiffEpr Filter

Step 3 Evaluate DiffEpr

Fig. 3. Data Pre-processing pipeline

Table 1. Results from Pre-processing Pipeline After Step 1 Step 4 Step 5 Step 6

Trial I 5608 4429 4375 4338

Trial II 5649 4159 4125 4115

Trial III 5675 4142 4099 4093

Trial IV 5686 4864 4845 4842

1. Data cleaning: It was suggested that genes with a count per million (CPM) value less than 1 should not be considered as expressed (Chen et al., 2017). In this work, we chose TPM as the focus of analysis. Since TPM normalizes the raw count data against the transcript length and depth of sequencing, it

Fig. 2. Biomass concentration over time during the transition from aerobic steady-state to oxygen-limited steady-state. Plus signs (+) represent the cell concentration and the orange line represents the oxygen conditions. 539

2019 IFAC DYCOPS 540 Florianópolis - SC, Brazil, April 23-26, 2019Matthew Hilliard et al. / IFAC PapersOnLine 52-1 (2019) 538–543

has been suggested that TPM should be used to compare RNASeq data between multiple trials (Wagner et al., 2012). For data cleaning, we removed the genes with a TPM less than 2 for all 22 samples.

can extract a few directions of the dominant variances from a high-dimensional data set. In this case, the first two principal components (PCs) captured about 80% of variation amongst all 88 samples, and the scatter plots of these first 2 scores are plotted in Fig. 4. In Fig. 4, the circles and stars indicate the aerobic and oxygen-limited steady-states, respectively, while the triangles indicate the transition states. Fig. 4 clearly shows that despite the fact that the four trials follow a somewhat different trajectory from each other, their aerobic and oxygenlimited steady-states are clustered closely together, indicating similar initial and final states of the dynamic transition process. This result suggests that the four trials can be treated as biological replicates, while the different transient trajectory is likely due to the stochastic nature of gene-regulatory events. However, since each trial started at relatively the same condition, with the same perturbation induced, and ended in relatively the same condition as well, we do expect that the dominant genetic responses are shared among the four trials. To validate this, we performed the Pearson correlation analysis to examine how each trial is correlated with each other in terms of differentially expressed genes. The results show that nearly 1500 genes (out of 3674) show Pearson correlation coefficient large than 0.9, indicating highly similar behaviour from the four trials.

2. Evaluating gene differential expression (DiffEpr): To identify and evaluate the differentially expressed genes, we calculated the log2 fold change (Log2FC) for each gene during 𝑇𝑇𝑇𝑇𝑇𝑇i (x) ), the transition, which is defined as Log 2 FC = log 2 ( 𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 (0)

where TPMi(0) is the TPM for the reference state of gene i, which is the average of the two aerobic steady-state samples. For each trial we obtain an ngene×20 Log2FC matrix, where each column corresponds to a dynamic sample point (18 transition states and 2 oxygen-limited steady-states). 3. DiffEpr filter: Literature suggests that for a gene to be considered truly differentially expressed, it must have a Log2FC value greater than 1.25. Thus, in this step, we remove genes that have |Log2FC|<1.25 for all 20 points. 4. Identifying and separating inactive-to-active genes: for genes that are inactive under the reference state (i.e., aerobic growth) but activated due to the stimulus, their corresponding TPM reference values are very small and would inflate Log2FC values and may cause misleading results. In this work, these genes are labelled as inactive-to-active genes and are analysed separately. 5. Data correction: For genes with TPM values smaller than 1 during the transition states (indicating deactivation), we set the TPM to “1”. Since the number of such genes is relatively small and it is often associated with genes with low expression at the reference state, we took this approach for simplicity.

3. DYNAMIC TRANSCRIPTIMIC DATA ANALYSIS x104

In this section, we present our initial data analysis results. First we provide an overview of the dynamic transcriptomic data from all four trials, then we focus on the analysis of Trial I.

Fig. 4. Score plot for TPM values from all four trials. The circles and stars indicate the aerobic and oxygen-limited steady-states, respectively, while the triangles indicate the transition states.

3.1 Overview of All Four Trials

3.2 Transient states vs. steady-states

In the experimental design, the four trials were planned to be 4 biological replicates; and ideally, they should start with the same inoculum. However, due to limited space and man power, it was not possible to collect time series samples every 10 minutes from 4 bioreactors simultaneously. As a result, the four trials were conducted sequentially with different batches of inoculum. Even though the cells were grown to aerobic steady-state following the same protocols in all four trials, some biological variance still exists amongst the trials. Therefore, before all 4 trials can be evaluated to determine the global response, they must first be evaluated to determine whether or not they can be considered biological replicates. To examine the similarity of all 4 trials, we applied principal component analysis (PCA) to analyse the TPM value for all 88 samples (22 samples per trial).

One important observation of Fig. 4 is that compared to the final steady-state (i.e., oxygen-limited, shown as stars), the transients states (shown as triangles) show significantly larger range of variations. This also suggests that if only the two steady-states were considered, key information on regulatory mechanism that govern the transition between the two steadystates could be missed. To validate this, we examined the Log2FC of genes that are deemed as differentially expressed from each trial, i.e., the genes that showed differential expression in at least one sample out of the 22 samples from a trial. We divided all differentially expressed genes into two groups: the first group (I) includes genes that show differential expression in the final, oxygen-limited steady-state; the second group (II) includes genes that show differential expression only during the transition period. In other words, genes in the second group return to their aerobic steady-state level in the

PCA is a dimension reduction approach commonly used for monitoring large-scale, dynamic industrial processes, which 540

2019 IFAC DYCOPS Florianópolis - SC, Brazil, April 23-26, 2019Matthew Hilliard et al. / IFAC PapersOnLine 52-1 (2019) 538–543

new oxygen-limited steady-state. Table 2 is the result of grouping for all four trials, which shows that 47% ~ 73% of the differentially expressed genes show changes during the transition period only. Some examples of the group I and group II genes are shown in Fig. 5.

then selecting the genes with loading values of the first PC larger than 0.01. Among 3297 genes that show differential expression (from both groups I and II) among all four trials, 2684 (about 80%) of them were selected as their loading values are larger than 0.01 or smaller than -0.01. Among them, 1512 are annotated genes. Fig. 6 illustrates the effect of this selection, where Fig. 6(a) shows all 3297 genes in the principal subspace spanned by the first 3 PCs, and Fig. 6(b) shows the 1512 selected genes. From these figures it is clear that such selection enables a significantly improved clustering by considering genes that dominate the changes of the transition period.

Table 2 Percentage of Genes Differentially Expressed in Oxygen-limited Steady-state and Transient State Only Trial I II III IV

Group I 38% 31% 27% 53%

(a)

541

Group II 62% 69% 73% 47%

(b) (a)

(c)

(d)

Fig. 5. TPM vs Time data for genes in Group I (a&b) and Group II (c&d). Blue Line: Trial I, Orange Line: Trial II, Yellow Line: Trial III, Purple Line: Trial IV.

Fig. 4, 5 and Table 2 together highlight the importance of studying the dynamic transient response of cellular metabolism, in order to understand the global response triggered by the perturbation and potential gene-regulatory mechanism. While the new steady-state represents the phenotype expressed by the cells, it is likely a direct result of the path taken during the transition period, and thus, the genes differentially expressed during the transition period, can have significant influence on the final steady-state. Therefore, it is of significant interest to identify the key genes affected during the transition period and discover their role in regulating the final steady-state phenotype.

(b) Fig. 6. (a) Score plot for first three PC directions for all 3297 genes that are DiffEpr. (b) Score plot for scores from first three PC directions for 1512 annotated genes selected by PC loading value.

3.3 Clustering analysis of the dynamic transcriptomic data Here we present our preliminary results on analysing the dynamic transcriptomic data from Trial I. To obtain a big picture on how different genes responded during the transition, we focused on the annotated genes that dominate the changes during the transition period. This is done by first performing PCA on the TPM data of all differentially expressed genes, and

Fig. 7. Dendrogram for the hierarchical clustering of the identified 1512 genes. The clustering results in two large clusters which are clearly distinct: Cluster I (red) and Cluster II (blue).

541

2019 IFAC DYCOPS 542 Florianópolis - SC, Brazil, April 23-26, 2019Matthew Hilliard et al. / IFAC PapersOnLine 52-1 (2019) 538–543

Next, clustering was performed on the selected annotated genes. To reduce the effect of the measurement noises contained in dynamic transcriptomic samples, we apply clustering on the PCA scores of the TPM data, instead of on the TPM data directly. Since the first 5 PCs captured over 95% of variance over the transition periods, the first 5 scores, instead of the 22 samples, were used for clustering. Euclidean distance metric with Ward method was used for clustering and two major clusters were identified as shown in Fig. 7. Cluster I contains 1265 genes which were mainly downregulated as shown in Fig. 8(a); while Cluster II contains 247 genes which were mainly upregulated as shown in Fig. 8(b). As indicated by Fig. 7, Cluster I can be further divided into 6 subclusters, and Fig. 9 plots the representative behaviour of each subcluster. The most represented pathways in Cluster I include cell cycle, meiosis, MAPK signalling and purine metabolism. Similarly, Cluster II can be further divided into 4 subclusters, and Fig. 10 plots the representative behaviour of each subcluster. The most represented pathways in Cluster II include ribosome biogenesis, glycolysis/gluconeogenesis, oxidative phosphorylation, etc.

growth-related, as the cell growth rate reduced significantly when reduced oxygen-supply was introduced, and stayed low throughout the transition and the final steady-state. This is clearly indicated by the dilution rate of Trial I (Fig. 11, the other trials are all similar).

(a)

(b)

(c)

(d)

(e)

(f)

(a)

Fig. 9. (a-f) TPM vs Time Plot for Representative genes for the six subclusters of Cluster I.

(b) Fig. 8. (a) TPM vs Time Plot for Cluster I Genes. (b) TPM vs Time Plot for Cluster II Genes.

As shown in Fig. 9, all 6 subclusters in cluster I show significant down regulation after the environmental perturbation (i.e., reduced oxygen supply) was introduced, and 5 of the 6 subclusters showed some level of recovery at the final steady-state, with only one subcluster (Fig. 9d) maintaining the down-regulation at the final steady-state. Similarly for cluster II, in Fig. 10 only one subcluster (Fig. 10a) maintained the up-regulation in the new steady-state while the other 3 subclusters recovered to the proximity of their initial steady-state. Although it has been reported that most gene-regulation responses are growth related, it should be noted that such recovery at the final steady-state was not

(a)

(b)

(c)

(d)

Fig. 10. (a-d) TPM vs Time Plot for Representative genes for the four subclusters of Cluster II.

542

2019 IFAC DYCOPS Florianópolis - SC, Brazil, April 23-26, 2019Matthew Hilliard et al. / IFAC PapersOnLine 52-1 (2019) 538–543

543

state which could enable cell’s fast respond to different disturbances. 6. ACKNOWLEDGEMENT This work was supported by National Science Foundation (1069004 for MH, 1264861 for JW and QHE) and Department of Education (P200A150074 for MH and JW). REFERENCES Chen, Y., McCarthy, D., Ritchie, M., Robinson, M., Smyth, G.K. [2017]. edgeR: differential expression analysis of digital gene expression data User’s Guide. Damiani, A. L., He, Q. P., Jeffries, T. W., & Wang, J. [2015]. Comprehensive evaluation of two genome‐scale metabolic network models for Scheffersomyces stipitis. Biotechnology and bioengineering, 112(6), 1250-1262.

Fig. 11. Dilution Rate vs Time Plot for Trial I. A similar trend was observed in all four trials.

5. CONCLUSION AND DISCUSSION In this paper we present our initial results on the analysis of a dynamic transcriptomic data set of S. stipitis during the transition from aerobic growth to oxygen-limited fermentation. This data set, to our best knowledge, is the first dynamic data set of S. stipitis, an important yeast strain in biorenewables, to capture the changes in cellular transcriptome triggered by the reduced oxygen supply.

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R. [2013]. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21. DNA Sequencing Facility: Illumina Sequencing 2018, University of Wisconsin-Madison Biotechnology Center, viewed February 2018, https://www.biotech.wisc.edu/services/dnaseq/servic es/Illumina

The data set contains RNA-seq data collected from four independent trials. Although the four trials were conducted with different inoculum, the cells were controlled to and maintained at the same initial steady-state (aerobic growth) before the same perturbation (reduce oxygen supply) was introduced. Our initial analysis confirms that these four trials can be treated as biological replicates, as their initial and final steady-states are very close to each other; in addition, the transient trajectories show high similarity.

Li, B., Dewey, C.N. [2011]. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323. Ma, J., Zhang, B.-B., Wang, F., Sun, M.-M., Shen, W.-J., Lv, C., Gao, Z., Chen, G.-X. [2017]. RNA-Seq Analysis of Differentially Expressed Genes in Rice under Photooxidation. Russ J Plant Physl+, 64(5), 698706.

The initial analysis also confirms that measuring the transient states is very important for understanding how a cellular metabolism responds to a given stimulus at genome-scale. More than half of the differentially expressed genes only showed changes during transient states, and returned to the proximity of the initial state after reaching the new steadystate. Even for the genes that show differential expression at the new steady-state, many of them experienced much larger or reverse changes during the transient states compared to the new steady-state. If only the transcriptome profiles at the two steady-states were compared, important information would have been missed.

Rong-Mullins, X., Ayers, M.C., Summers, M., Gallagher, J.E.G. [2018]. Transcriptional Profiling of Saccharomyces cerevisiae Reveals the Impact of Variation of a Single Transcription Factor on Differential Gene Expression in 4NQO, Fermentable, and Nonfermentable Carbon Sources. G3-Genes Genom Genet, 8. 607-619. Wagner, G.P., Kin, K., Lynch, V.J. [2012] Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281-285.

Multivariate statistical tools such as PCA developed in process systems engineering for large-scale, complex dynamic industrial processes were integrated with bioinformatics tools such as clustering to analyze the dynamic transcriptomic data. Such integration enables a much improved clustering performance, and potentially better understanding of the gene regulatory mechanism. Our analysis showed that for the genes that dominate the transition period, most of them returned to the proximity of the initial steady-state (i.e., aerobic growth), and such recovery is not growth associated. The closely related initial and final steady-states suggests that the steady-state gene expression profile might be a highly conserved, optimal 543