OTP: An automatized system for managing and processing NGS data

OTP: An automatized system for managing and processing NGS data

Accepted Manuscript Title: OTP: An automatized system for managing and processing NGS data Authors: Eva Reisinger, Lena Genthner, Jules Kerssemakers, ...

1MB Sizes 0 Downloads 60 Views

Accepted Manuscript Title: OTP: An automatized system for managing and processing NGS data Authors: Eva Reisinger, Lena Genthner, Jules Kerssemakers, Philip Kensche, Stefan Borufka, Alke Jugold, Andreas Kling, Manuel Prinz, Ingrid Scholz, Gideon Zipprich, Roland Eils, Christian Lawerenz, Jurgen ¨ Eils PII: DOI: Reference:

S0168-1656(17)31592-4 http://dx.doi.org/doi:10.1016/j.jbiotec.2017.08.006 BIOTEC 7986

To appear in:

Journal of Biotechnology

Received date: Revised date: Accepted date:

22-2-2017 3-8-2017 7-8-2017

Please cite this article as: Reisinger, Eva, Genthner, Lena, Kerssemakers, Jules, Kensche, Philip, Borufka, Stefan, Jugold, Alke, Kling, Andreas, Prinz, Manuel, Scholz, Ingrid, Zipprich, Gideon, Eils, Roland, Lawerenz, Christian, Eils, Jurgen, ¨ OTP: An automatized system for managing and processing NGS data.Journal of Biotechnology http://dx.doi.org/10.1016/j.jbiotec.2017.08.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

OTP: An automatized system for managing and processing NGS data Authors  Eva Reisinger 1 2  Lena Genthner 1  Jules Kerssemakers 1  Philip Kensche 1  Stefan Borufka 1  Alke Jugold 1  Andreas Kling 1  Manuel Prinz 1  Ingrid Scholz 1  Gideon Zipprich 1  Roland Eils 1 2 3 4  Christian Lawerenz 1  Jürgen Eils 1 1 Department of Theoretical Bioinformatics, German Cancer Research Center (DKFZ) 2 Heidelberg Center for Personalized Oncology, DKFZ‐HIPO, DKFZ, Heidelberg, Germany 3 Institute of Pharmacy and Molecular Biotechnology and Bioquant Center, University of Heidelberg, Heidelberg, Germany 4 Translational Lung Research Center Heidelberg (TLRC), German Center for Lung Research (DZL), University of Heidelberg, Heidelberg, Germany

Highlights of OTP - OTP, the in-house framework of the German Cancer Research Center (DKFZ) has processed 20.000 samples to date. - OTP contains the data of all German submissions to ICGC and of the German Consortium for Translational Cancer Research. - OTP normally requires no manual intervention, handling data import, processing and user notification. Even many processing errors are resolved automatically, enabling 24/7 operation. - OTP provides structured data storage and setting of access permissions. - OTP can interact with any job execution framework which contains bioinformatic pipelines. This enables OTP to integrate new pipelines in a short time. Currently it interacts with Roddy. - OTP integrates a database which contains all information about the sequencing data; from the sample preparation to processing information of analyses. - OTP has an intuitive and structured graphical user interface which provides all kind of statistical information, quality reports, processing results and statuses, and overviews about projects and included data. p. 1/23

- OTP is used in clinical context at the German National Center for Tumor Disease (NCT) and in research at the DKFZ and collaboration partners. - OTP does not depend on a specific cluster system. - OTP implements a user management system to ensure data security.

Abstract The One Touch Pipeline (OTP) is an automation platform managing Next-Generation Sequencing (NGS) data and calling bioinformatic pipelines for processing these data. OTP handles the complete digital process from import of raw sequence data via alignment of sequencing reads to identify genomic events in an automated and scalable way. Three major goals are pursued: firstly, reduction of human resources required for data management by introducing automated processes. Secondly, reduction of time until the sequences can be analyzed by bioinformatic experts, by executing all operations more reliably and quickly. Thirdly, storing all information in one system with secure web access and search capabilities. From software architecture perspective, OTP is both information center and workflow management system. As a workflow management system, OTP call several NGS pipelines that can easily be adapted and extended according to new requirements. As an information center, it comprises a database for metadata information as well as a structured file system. Based on complete and consistent information, data management and bioinformatic pipelines within OTP are executed automatically with all steps book-kept in a database. Keywords: Next-Generation Sequencing; data management; data processing; automation; user interface; standardization

1. Introduction Human genetics and genomics research has massively expanded in the past decade and constant cost reduction made it feasible to introduce the Next-Generation Sequencing (NGS) technology into daily hospital routine and large cohort analyses. Both the size and diversity of datasets produced in healthcare and medical-research have increased, which challenges the capabilities of life science data centers to correctly and efficiently process and store data. To cope with the flood of data, large storage pools and compute facilities, comprehensive processing platforms and automatic data handling are urgently needed to reduce human work. Through automation the large quantities of genomic data can be processed with small staff expenditure and fewer faults then by manual interaction. In a research context, automatically managed workflow initialization and maintenance allows the bioinformatics experts to focus on developing new analysis methods and interpretation rather than processing the data. In clinical context, automation ensures fast provision of results relevant for treatment. How NGS data processing is managed at a large data provider such as the German Cancer Research Center (DKFZ) and the affiliated National Center for Tumor Diseases (NCT) [1] is presented in this publication. The DKFZ is one of the largest biomedical research centers and genome sequencing facilities in Europe. By now, most of the enormous data flow produced within the DKFZ is processed automatically via the One Touch Pipeline (OTP). OTP is a platform for structured data storage, data access management as well as processing of NGS data. It is designed to meet the data management needs in a diverse scientific environment and clinical research. Data processing capacity is provided to DKFZ's in-house projects like the Heidelberg Center for Personalized Oncology (HIPO) [2] as well as to partners in the German Network for Translational Cancer Research (DKTK) [3] and the German p. 2/23

contributions to the International Cancer Genome Consortium (ICGC) [4]. In most of these projects, several hundreds of genomes have been processed and managed via OTP, such that a total of about 20.000 samples covering more than 11.000 donors have been accrued since the beginnings of OTP in 2012 (Fig 1). In 2013, the NCT established sequencing for many of their cancer patients. This shift towards clinical application led to new requirements in order to provide a solution as a basis for treatment decision in clinical cancer care. In particular, clinical data processing demands faster and more automated processing, improved data protection and security as well as streamlined quality assurance. The automated management of recurrent, standardized analyses of diverse NGS data types is the key feature of OTP. Starting at automated meta- and raw data import, OTP handles the complete digital process from quality control and sequence alignment to the identification of single-nucleotide and structural genomic events, without manual interaction. Throughout this complete process OTP provides data provenance, quality monitoring and alerting, automated notification of processing status and automated error handling. To fulfill the requirements in the field of both, clinical and basic cancer research, a so called fast-track procedure was implemented which enables OTP to handle the data relevant for treatment decisions with a higher priority than the research data. Furthermore, since data security is a major concern in the human genomics field, OTP allows sustainable user management. Raw, processed and result files as well as access to according web sites is exclusively provided to authorized users. Several other platforms also seek to provide such integration and automation and therefore prominent examples will be discussed and compared to OTP in chapter 9. It will be shown that OTP is strong in project data organization, performance and automation.

2. Dataflow in OTP The general dataflow in OTP consists of several steps. First, the data is imported into OTP whereby raw data is stored on the file system and corresponding metadata in the database. Next, the alignment and the calculation of quality control values for the FASTQ files are triggered. After the alignment is completed, further analyses like variant calling are performed. To execute most of these analyses, OTP integrates Roddy, an external job execution framework [5], which contains pipelines developed by onsite bioinformaticians. At the end of each step the user is automatically notified about the current status. OTP is able to process whole genome sequencing (WGS), whole exome sequencing (WES), whole genome bisulfite sequencing (WGBS), ChiP-seq and RNA sequencing data. A scheme of the dataflow is shown in Fig 2.

2.1. Data import The sequencing data and the corresponding metadata need to be registered in the OTP database and stored on the file system in a defined and consistent way. To minimize manual work during import and to allow import at night or during weekends, the process is fully automated. This was achieved by integrating the service management software OTRS [6], which was adapted to call a URL on the OTP server as soon as a notification about new sequencing data arrives. For manual import from sequencing data providers without defined interface to OTP, a GUI is implemented that allows data import with a few clicks. In both manual and automatic import, OTP validates if the metadata file contains all information required for processing and storage on the file system. In case of successful validation, the actual import process is initiated, meaning that the metadata is stored in the OTP database and the sequencing files are installed on the file system. If validation fails due to ambiguities or missing information, manual interaction is required to clarify the metadata with the provider. p. 3/23

2.2. Processing of sequencing files After the import of the sequencing data, quality control values are calculated for all FASTQ files using FastQC [7]. Simultaneously, several processing steps, covering alignment and corresponding quality control, single nucleotide variation (SNV) calling [8,9], insertion and deletion (InDel) calling [10], structural variation (SV) calling [11], and copy number variation (CNV) calling [12], are applied. In many of these pipelines additional QC values are produced using in-house developed calculations. An overview of the different pipelines and integrated tools is given in Tab. 1.

Alignment

with Duplication Marking

WGS

Exome

ChIPseq

RNA

WGBS (tagmentation)

bwa mem [13], bwa aln [14]

bwa mem, bwa aln

bwa mem, bwa aln

STAR [15]

bwa mem *

Picard [16], biobambam [17], sambamba [18]

picard, biobambam, sambamba

picard, biobamba m, sambamb a

sambamba

methylCtools [19]

Methylation calling flagstats [20],

flagstats,

flagstats,

Alignment QC

in-house tools

in-house tools

in-house tools

SNV

samtools (mpileup, bcftools) [20],

samtools (mpileup, bcftools)

platypus [10]

RNA-SeQC [21], Qualimap2 [22]

flagstats, in-house tools

platypus InDel

platypus

SV

SOPHIA [11]

CNV

ACEseq [12]

platypus

Tab. 1: Overview of the tools used in the different processing steps (rows), separated per sequencing type (columns).*adapted for whole genome bisulfite data Whether and how pipelines are executed depends on the sequencing type and on configurations defined per project. For execution three different conditions must be fulfilled: (1) the configuration for the specific pipeline and the project must be specified, (2) input parameters, like adapter sequence or library preparation kit, must be stored in the OTP database and (3) processing thresholds, like a minimal genome coverage have to be reached. The results of the pipelines are stored on the file system and QC values and further result information are stored in the database and provided via the GUI. For reproducibility purposes, the exact commands and parameters for the processing are stored in the p. 4/23

database. After each processing step OTP ensures that only authorized users can access the result data on the file system by setting the access permissions based on the Unix group defined for the project. Data submitter and registered project members are automatically notified about successful termination of the process via email. The notification contains information, like the location of processed data on the file system, the link to the result data in the GUI, and subsequent analyses which will be performed.

2.3. OTP components For the processing of data in a computing environment, OTP uses a number of separate components (Fig 3). All compute intense processing tasks on genomic data are submitted to a high-throughput cluster running a batch processing system. OTP offers direct submission or indirect submission through the execution of the external job execution system Roddy. For simple tasks, including file management tasks and executing simple bioinformatics analysis tools, like FastQC [7] or bwa aln [14], OTP submits the jobs directly to the cluster. More complex workflows, like the variant calling analyses, are submitted indirectly via Roddy. OTP collects all required input and configuration information from the database and executes Roddy providing this information. In turn, Roddy triggers the required analysis, submits all compute intense jobs to the cluster and returns the corresponding job identifiers to OTP. To determine whether an analysis is finished and to allow flexible job-dependent handling of processing errors, OTP regularly checks the status of the jobs directly on the cluster. For the interaction with the cluster the open-source library BatchEuphoria [23], which serves as a cluster abstraction layer, is integrated. With this library Roddy and OTP can work independently from a specific cluster system. The interface between Roddy and OTP is generic enough to allow the integration of arbitrary job execution systems like Snakemake [24] or Galaxy [25,26]. The architectural separation into job execution framework (i.e. Roddy) and automation platform (OTP) has the advantage to allow for quick adaption of state-of-the-art workflows from research.

3. Database The relational database of OTP contains all information on how to organize, structure, process and visualize the data. This information is stored in four sections of the database scheme. Basic information about sequencing data received during import is stored in the first section and contains meta-information like project, patient, sequencing type, sample type and sequencing machine (Tab. 2). These properties are used to organize the FASTQ, BAM and result files on the file system and to define how the FASTQ files can be further processed. The second section of the database handles authorization information for each project, to make sure that the data on the file system and according information on the GUI is accessible only to authorized users. While processing the data, OTP retrieves information on groups and permissions from the database and enforces them on the file system. The third section of the database manages the processing. Processing status and requests are modeled as “processing objects” (PrOb), which are database objects and represent the information about how data must be analyzed. During processing, PrObs are used to log the exact processing history and parameters (see chapter 4 “Automation”). This section additionally includes objects that represent raw and result data as well as objects containing quality and configuration values. Bioinformatic parameters such as reference genome, minimum coverage required for calling variations, or processing tools and versions to use, are stored in this section as well. The fourth section in the database contains information about the cluster jobs and error handling. Each job executed on the cluster is recorded and monitored. Common problems and error messages are defined in this section. Based on this information, strategies have been developed and stored in the OTP database to manage failing jobs at any time. p. 5/23

FASTQ_FILE MD5 CENTER_NAME RUN_ID RUN_DATE LANE_NO BARCODE MATE PROJECT PATIENT_ID BIOMATERIAL_ID SAMPLE_ID SEQUENCING_TYPE LIBRARY_LAYOUT INSTRUMENT_PLATFORM INSTRUMENT_MODEL PIPELINE_VERSION INSERT_SIZE SEQUENCING_KIT LIB_PREP_KIT

EXPLANATION Name of raw NGS data file (FASTQ) Md5sum of data file Acronym of sequencing center ID of sequencing run Date of run Lane number In case of multiplexing Mate Number (1 or 2) Name of the project Pseudonymized identifier of the patient Type of the sample ID of biosample Sequencing technology Library layout Type of sequencing machine Model of sequencing machine Basecalling algorithm/pipeline Average size of library fragments in bp Version of sequencing chemistry Chemistry used for library preparation

EXAMPLE xxx_ATCACG_L008_R1.fastq.gz b6e70098dfefe7d08cc6b1c376e84f72 MDC 130613_SN700xx_0101_AC16PKAC X 2013-06-13 8 ATCACG 1 project_ABX ABX_01 tumor ABX_01_tu EXON PAIRED Illumina HiSeq2000 bcl2fastq_1.8.4 300 V2 Agilent SureSelect V4

Tab. 2: Example of a metadata file imported to OTP. Parameters regarding both provenance (center_name, sample_id) and technical handling (pipeline_version, lib_prep_kit) are requested for long-term archiving in the database.

4. Automation To be able to process data automatically OTP requires processing information stored in the database, in the appropriate section. The processing information is provided to OTP when a new project is set up. During data import, OTP checks the database for how the processing should be done for the project and creates processing objects. The PrObs serve as messages between the different processing steps and carry state information about processing status. Currently, six different types of steps are implemented: “start step”, “execute step”, “validation step”, “parse QC step”, “transfer results step”, “notification step”. The number and order of processing steps, whose sequence is called “workflow” in OTP, can be flexibly defined through a domain-specific language (DSL). Each OTP workflow (Fig 4) has its own implementation of the different steps, except of the “notification step” which is shared between all workflows. Following the data import, OTP automatically executes the start steps to check the database every few minutes for PrOb that require processing. If a PrOb with an appropriate flag and a corresponding configuration is found, the PrOb is passed to the execute step and is logged in the database. The execute step collects all information needed for processing the data and invokes Roddy, which in turn p. 6/23

submits all compute intense jobs to the cluster. OTP queries the cluster on a regular basis to check if all cluster jobs belonging to a specific processing object are finished. After all jobs belonging to one “execute step” are completed, the “validate step” checks if all necessary output files were created. In case of a successful validation, the quality control results are parsed and stored in the database in the “parse QC step”. In the “transfer results step” the result files are rendered visible and the correct access permissions are set. The last step is always the “notification step” which sends a notification email to the customers and the service/ticket management system OTRS. When all steps within one OTP workflow are processed the PrOb is marked as finished and follow-up processing objects for further analyses, like SNV calling, are created. In addition to the processing, OTP is able to handle many errors automatically. For this purpose, an error handler in OTP parses the error messages and log files and compares them to error descriptions stored in the database. In case a description matches, OTP knows how to handle the error. Currently there are three strategies implemented: restarting the failed job, restarting the complete workflow, or sending an email containing exact information about the failed job, the error description and the location of the log file to the responsible persons of OTP. Further error solution approaches, like parameter adaptions or the usage of other tools for the same purpose, are planned.

5. User interface The OTP server (otp.dkfz.de) provides a graphical user interface (GUI) with dedicated views for OTP's various user types. The external users are project coordinators, principal investigators and bioinformaticians. These users want to get an overview of the project, track the progression of analyses, and oversee the quality of samples in the project (Fig 5). In particular, the bioinformaticians can learn details about sample quality and identify specific problems with several plots and summary statistics originating from the different pipelines and presented in the GUI (Fig 6). For alignment, QC thresholds are defined and visual highlighting indicates whether the quality metrics pass minimum requirements for further processing. Furthermore, for follow-up analyses and interpretations, the bioinformaticians can get detailed metadata for all processing steps and the locations of all input and output files on the file system. This first large user group of external users is complemented by the group of administrative users, who are responsible for the operation of OTP itself – the maintainers and operators. Maintainers supervise data processing via administration pages that show more details about the processing steps and cluster jobs than available to the external users. Problems are highlighted in detail, which allows the maintainer to quickly identify and handle unexpected errors. Maintainers are authorized to load new error messages into the automatic error handler and define desired remediation strategies. In contrast to maintainers, data operators are more involved in the communication necessary to operate a multiuser and multi-project data management system. They are assisted when importing data manually into OTP with several web forms. In particular, a validator checks all metadata before importing them into OTP. If any information does not pass validation, details are presented to the operator about what needs to be changed to pass the validation successfully. There are several graphical templates available to configure the different pipeline parameters. The user management and access control, crucial for data protection and security, is also organized via the GUI by authorized personal. To make sure only authorized users are able to access the data, LDAP is connected for authentication. Furthermore, user management in OTP is implemented to identify

p. 7/23

which projects the user is authorized to access, and which actions they are allowed to perform, e.g. administration.

6. Statistics The implementation of performant bioinformatic pipelines on a large computer cluster is a complex undertaking. To identify bottlenecks and needs for improvement, statistical data based on the information logged in the OTP database during processing is used. Data is collected to track averages of queuing delays, processing times, memory and CPU usage and error frequency for each cluster job (Fig 7). These statistics are used to request the appropriate resources on the cluster to optimize its usage. Furthermore, steps can be identified that take disproportionately long or need too many resources. For instance, some tumor samples with extremely high rate of structural reorganization suffer from elongated processing times during alignment and structural variation calling. This is a problem because their cluster jobs may require much more time than what is conventionally requested from the scheduler. Other examples are occasionally occurring low-quality samples and samples consisting mainly of adapter sequences with extremely short processing times. The identification of problems of this sort can assist workflow and quality control improvements. Furthermore, long term statistics allow reconstructing if changes in a pipeline implementation led to robust improvements of the processing like acceleration or reduce in memory usage. These statistics are visualized in the GUI and are easy to access and interpret.

7. Development framework and process OTP is written in Groovy [27], a language running on the Java Virtual Machine [28]. This allows the use of the wealth of existing, robust libraries in the Java ecosystem, while the power of the Groovy language like closures, dynamical typing and meta object protocol, enables rapid development. The OTP development is further accelerated by using the Grails framework [29] for the web-facing side, since it provides a simplified access to various libraries like groovy server pages. The relational database management system PostgreSQL [30] is used as our persistence layer, in combination with Grails' built-in object-relation mapper (GORM). Continuous integration and smooth releases are assisted by the continuous integration system Jenkins [31]. To improve the GUI and to determine which parts are visited more often than others, the open source analytics tool Piwik [32] is used which complies with the strict German data privacy laws. To reflect the constantly changing requirements in the scientific field, agile practices are applied in daily development. To organize the development process the scrum model is used. This includes sprints, backlog grooming, sprint planning and daily stand-ups. In particular the daily stand-ups turned out to be important for the effective communication between programmers and bioinformaticians directly involved in the integration of new workflows into OTP. Weekly releases enable the customer to use the latest improvements and to provide fast feedback to the developers. Unit and integration tests are implemented to test complete workflows in OTP.

8. IT infrastructure Because bioinformatics analyses are highly complex, requirements in IT infrastructure are diverse and challenging. One of these challenges is the use of various libraries and open source tools in different versions, which makes it very complex to organize and overview them simultaneously. Another challenge is to prevent data loss and to keep it accessible for long time, in particular when dealing with patient data. For safety reasons relevant data is mirrored on a spinning disk. Snapshots for a time period of several weeks require additional disk space overhead. Bioinformaticians need disk space for further downstream analyses. Guided by experience, and in order not to limit the scientist, the required p. 8/23

disc space for analyses is in the same scale as the one for the raw data. In total, the disk space required is four times the size of the raw data to take mirroring into account. As most analysis pipelines are very compute intense, a performant, reliable and resilient cluster infrastructure, with 24 h availability, is required. Breaking down the processes into small pieces, explained in chapter 4 “Automation”, helps to control the entire process efficiently and to get an overview of the requested hardware resources. Based on this information, further required volume and range of network bandwidth, disk space and computing nodes can be estimated and optimized. Statistical values are collected within OTP and used for the calculation of required computing resources such as CPU and memory. The network bandwidth is estimated based on the amount of large files, like BAM and FASTQ files being read in and out. The usage of such huge files raises new challenges like sufficient size and performance of the storage system. IT resources for routine analyses, comprising alignment and variant calling, are roughly estimated as follows: Assuming 10.000 WGS lanes per year with a 30x coverage per lane and 80 GB BAM file size are processed in OTP, roughly 600 cores, 5 TB RAM, 3.2 Petabyte disk space and network bandwidth capability of about 1,2 Gbit/sec are required. The complete IT infrastructure needs to be adapted permanently to cope with the increasing amount of NSG data.

9. Comparison with other platforms There are various NGS data management and analysis platforms available, each with its specific features and capabilities. In this section several popular platforms will be discussed and compared. HTS-Flow [33], developed by the Istituto Italiano di Tecnologia in Milano, interacts with their local LIMS and automates NGS analysis in a traceable manner through a simple GUI. The WASP System [34] coordinates sample submission and subsequent processing. A unique focus is their open-source roll-out strategy, actively inviting other institutes to collaborate. In this comparison only the processing part of WASP system is discussed. QuickNGS [35] includes a wide variety of tools to accelerate the most-used operations on NGS data. Users can trigger these via a web-based GUI. Chipster [36] allows creation and saving of workflows in a desktop program that interfaces with a local server. It focuses on integrating many tools and external datasets. Omics Pipe [37] provides a Python package that integrates many tools, and allows batch processing for IT-savvy users. Tab. 3 shows a comparison of the platforms regarding criteria essential for management and processing of genomics data.

p. 9/23

OTP

Data type

HTS-Flow

WGS, WGBS, WGBS-Tag., Exome, RNA, ChIP-seq

WGS, WGBS, RNA, ChIPseq

WASP System Exome, RNA, ChIP-seq, miRNA, HELPtagging

QuickNGS

WGS, RNA, ChIP-seq, miRNA

Chipster RNA, ChIP-seq, miRNA, methylSeq, microarray data

Omics Pipe WGS, Exome, RNA, ChIP-seq, miRNA

Meta data information GUI Access control













Automation Reproducibility of analyses





Flexibility Availability Data provision

as service*

open source

file system & download

R data objects

Groovy, Grails, Bash

R BatchJob library, php

25.07.2017

13.07.2016

DKFZ – Theoretical Bioinformatics

Fondazione Istituto Italiano di Tecnologia

open source

open source

open source

open source

download

file system

file system

Bash, Perl, R

Java

Python

10.11.2016

03.07.2017

12.07.2017

Bioinformatics Core Facility, CECAD Cologne

CSC-IT Center for Science

The Scripps Research Institute

Version control Programming languages Latest development Institution

Spring Java, J2EE

Albert Einstein College of Medicine

Tab. 3: Comparison of the several platforms regarding criteria essential for management and processing of genomics data;* in preparation for open source.Metadata information stored in the

database Not only the information how to process the data and the implementation of processing pipelines is important, but also keeping track of the origin, the preparation and the properties of the raw data such as information about the sample, the corresponding project, the library preparation or sequencing information. This information is a prerequisite for providing an overview about the project data and how to process it. For example, HTS-Flow automatically imports all information crucial for the data analysis while downloading and converting the raw files from different repositories. This functionality enables HTS-Flow to guarantee that data provided by different sources is processed uniformly. To run the analysis QuickNGS also imports several metadata information like library type and batch information. Moreover information about data provenance and sample preparation are stored. Like QuickNGS, OTP imports both metadata information required for data processing and data origin. Further information about the sequencing process, like cycle count, and corresponding parameter and QC values, as well as quality of the sample preparation are included in OTP. p. 10/23

9.2. Graphical User Interface There is a wide variety in functionality and focus of the GUI for the different platforms. Chipster’s user interface allows the user to get an overview of the included analysis tools, imported datasets and workflows and provides a visualization of the result plots. Beside visualization of result data, QuickNGS focuses on monitoring the status of processing and provides a visualized overview of the metadata. HTS-Flow enables the bioinformatician to run analyses via the GUI. Like Chipster and QuickNGS, OTP's GUI shows metadata and provides an overview of the processing status, data quality and results. Following OTP's function of automatized processing of data by standardized pipelines as a service, OTP does not allow bioinformatic users to manually trigger or parameterize analyzes via the GUI, like HTS-Flow does. Instead analyses are usually supposed to be triggered automatically and parameters are set during project setup uniformly for all samples. Consequently, OTP has a strong metadata validation feature accessible through the GUI during manual import and allows the data operators to configure workflows via dedicated views. Also details about the data processing, such as cluster job statuses and error recovery strategies, are kept away from bioinformatic users, but are instead accessible through the GUI to maintainers. The GUI eases the management of users by operators and maintainers, as can be expected for a multi-user, multi-project service platform.

9.3. Access control There are two different levels of data access: direct access to the raw and result files on the file system and access to the result data provided via the user interface. In OTP the access control of these two levels is separated. This has the advantage that scientists who are not able or authorized to access the file system or the raw data are nevertheless able to enter the GUI and inspect the results. The publications for the other systems in this comparison do not provide information about the access control to the file system. It is not mentioned in all publications but it can be assumed that all of the compared tools use an authentication infrastructure. The WASP System should be highlighted in this context, since it is strong in data security. It uses an authentication and authorization infrastructure to control data access. Furthermore it enables the PIs to grant read-only access to researchers on a case by case basis.

9.4. Automation and Flexibility The level of automation provided by the various systems ranges from manual combination of different analyses to one ‘workflow’, that can then run through without further manual interaction at the one end, to fully automated processing at the other end. An example for a platform with less automation is Chipster. Here it is possible to combine different analyses to one workflow which can then be executed with one click. This workflow can also be stored for reuse and sharing with other researchers. The WASP System offers a high level of automation since the complete process from importing data via processing to notification is automatized. OTP covers all these automation features of the WASP system and enriches them with automated error handling (see chapter 4 “Automation”). Among the investigated systems there seems to be a trade-off between automation and flexibility. Therefore, Chipster with its low level of automation provides a high flexibility and enables the user to experiment with different methods and parameters. By contrast, the WASP System and OTP show less flexibility in adjusting parameters but provide a high level of automation. Flexibility can be examined from another perspective: the fast integration of new pipelines. Regarding the integration of new functionality both WASP System and OTP are very efficient due to modularity.

9.5. Reproducibility p. 11/23

An important consideration during the processing of NGS data, especially for big projects or interproject comparisons, is the reproducibility of the results. It is advisable to log how the data was created and which tools and parameters were used. Tracking of the process can be supported by high automatization so that the framework logs the information in a unified way during processing. Logging is implemented at different levels in the considered tools. It ranges from “the input-, annotation files and stored workflows are enough to reproduce the analysis” (Chipster) to “each execution command and parameter is logged in a database or the file system” (Omics Pipe). Like Omics Pipe, OTP also tracks processing information and tool versions in the database to ensure reproducibility.

9.6. Availability Except for OTP, which is in preparation for open source, the source code of all compared platforms is already available to the scientific community. Some of them even invite the community to implement new features and conduce to success (WASP). Others only provide the source code of the system so that it can be installed by external researchers. The purpose of OTP was to offer data management and processing as a service rather than providing the software itself. This encourages scientists to focus on downstream analysis and the development of new tools rather then executing basic analysis. In the context of the German Network for Bioinformatics Infrastructure (de.NBI), OTP is widespread to the German bioinformatics community as a service.

10.

Conclusion

Taken together, there are several platforms available that handle data management and processing. All of them show different sets of features and capabilities and are designed with different use-cases in mind. The main use-case for which OTP has been developed since 2012 is the completely automatic processing of mostly human sequencing data at one of the largest sequencing institutions in Europe and in the context of large-scale research cooperation projects. Processing is done by workflows that implement standard NGS analyses, such as alignment and quality control, as well as state-of-the-art analyses. The main goal was to reduce the workload for manifold repeated analyses off specialized researchers and thus freeing their time for their core research activities. The design decisions implemented in OTP, such as the database that contains meta- and provenance data, the GUI that provides a quick overview of the processing status and data quality, or the integration of the workflow management component Roddy that allows quick adaption of research pipelines by OTP, all support this main goal. Research and clinical requirements are under permanent development and OTP will continue to be adapted to the upcoming demands. The GUI will be extended to include additional features like extension of thresholds for QC color coding to all processing steps, or the possibility to group results of different projects for inter-project comparisons. A further development will be an extended QC monitoring that surveys whether samples reach defined quality thresholds and, if applicable, automatically informs the researcher about low quality samples while marking the files on the file system as affected. New analyses developed by the in-house bioinformaticians will be included like alternative pipelines for variation calling. To increase the flexibility and the number of analyses, the integration of other job execution framework is planned. To allocate OTP to other data centers, currently all dependencies on the DKFZ infrastructure are removed. As soon as this process is completed the software will be provided open source.

11.

Acknowledgments

The authors acknowledge the entire OTP team and all its members, past and present, for making OTP into what it is today. We would like to thank the Computational Oncology Group and the Applied Bioinformatics Group of the DKFZ for the development of the analysis pipelines and the job p. 12/23

execution framework Roddy. Special thanks to Matthias Schlesner, Michael Heinold, Florian Kärcher, Barbara Hutter, Ivo Buchhalter, Natalie Jäger, Kortine Kleinheinz, Naveed Ishaque, Umut Toprak, Matthias Bieg and Nagarajan Paramasivam. We also thank Julia Ritzerfeld and Liam Childs for valuable suggestions to the manuscript. The development of OTP was partially funded via grants by the German Federal Ministry of Education and Research (BMBF) projects LungSys (0315415A), ICGC PedBrainTumor (01KU1201A), CancerSys LungSysII (0316042A), CancerSys MYC-NET (0316076B), ICGC: The Genomes of Early Onset Prostate Cancers (01KU1001A), ICGC Malignant Lymphoma (01KU1002B), e:BIO: ImmunoQuant (0316170A), e:MED – PANC-STRAT (01ZX1305A), Deutsches Epigenom-Projekt DEEP (01KU1216B). Additional support came from the German Cancer Aid ICGC PedBrainTumor (109252), from the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology and from the Heidelberg Center for Personalized Oncology (DKFZ-HIPO). The distribution of OTP is supported by the BMBF-funded Heidelberg Center for Human Bioinformatics (HD-HuB) within the German Network for Bioinformatics Infrastructure (de.NBI) (#031A537A, #031A537C).

12. [1] [2] [3] [4] [5] [6] [7] [8]

[9]

References nct-heidelberg.de, (n.d.). https://www.nct-heidelberg.de/en.html (accessed February 15, 2017). hipo-heidelberg.de, (n.d.). http://www.hipo-heidelberg.org/hipo2/ (accessed February 15, 2017). dktk.dkfz.de, (n.d.). https://dktk.dkfz.de/en/home (accessed February 16, 2017). icgc.org, (n.d.). http://icgc.org/ (accessed February 16, 2017). The Roddy workflow development and management system, eilslabs, 2016. https://github.com/eilslabs/Roddy (accessed July 28, 2017). otrs.com - Simple service management, Otrs.Com. (n.d.). https://www.otrs.com/ (accessed February 15, 2017). FastQC A Quality Control tool for High Throughput Sequence Data, (n.d.). http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed February 15, 2017). D.T.W. Jones, N. Jäger, M. Kool, T. Zichner, B. Hutter, M. Sultan, Y.-J. Cho, T.J. Pugh, V. Hovestadt, A.M. Stütz, T. Rausch, H.-J. Warnatz, M. Ryzhova, S. Bender, D. Sturm, S. Pleier, H. Cin, E. Pfaff, L. Sieber, A. Wittmann, M. Remke, H. Witt, S. Hutter, T. Tzaridis, J. Weischenfeldt, B. Raeder, M. Avci, V. Amstislavskiy, M. Zapatka, U.D. Weber, Q. Wang, B. Lasitschka, C.C. Bartholomae, M. Schmidt, C. von Kalle, V. Ast, C. Lawerenz, J. Eils, R. Kabbe, V. Benes, P. van Sluis, J. Koster, R. Volckmann, D. Shih, M.J. Betts, R.B. Russell, S. Coco, G. Paolo Tonini, U. Schüller, V. Hans, N. Graf, Y.-J. Kim, C. Monoranu, W. Roggendorf, A. Unterberg, C. Herold-Mende, T. Milde, A.E. Kulozik, A. von Deimling, O. Witt, E. Maass, J. Rössler, M. Ebinger, M.U. Schuhmann, M.C. Frühwald, M. Hasselblatt, N. Jabado, S. Rutkowski, A.O. von Bueren, D. Williamson, S.C. Clifford, M.G. McCabe, V. Peter Collins, S. Wolf, S. Wiemann, H. Lehrach, B. Brors, W. Scheurlen, J. Felsberg, G. Reifenberger, P.A. Northcott, M.D. Taylor, M. Meyerson, S.L. Pomeroy, M.-L. Yaspo, J.O. Korbel, A. Korshunov, R. Eils, S.M. Pfister, P. Lichter, Dissecting the genomic complexity underlying medulloblastoma, Nature. 488 (2012) 100–105. doi:10.1038/nature11284. D.T.W. Jones, B. Hutter, N. Jäger, A. Korshunov, M. Kool, H.-J. Warnatz, T. Zichner, S.R. Lambert, M. Ryzhova, D.A.K. Quang, A.M. Fontebasso, A.M. Stütz, S. Hutter, M. Zuckermann, D. Sturm, J. Gronych, B. Lasitschka, S. Schmidt, H. Şeker-Cin, H. Witt, M. Sultan, M. Ralser, P.A. Northcott, V. Hovestadt, S. Bender, E. Pfaff, S. Stark, D. Faury, J. Schwartzentruber, J. Majewski, U.D. Weber, M. Zapatka, B. Raeder, M. Schlesner, C.L. Worth, C.C. Bartholomae, C. von Kalle, C.D. Imbusch, S. Radomski, C. Lawerenz, P. van Sluis, J. Koster, R. Volckmann, R. Versteeg, H. Lehrach, C. Monoranu, B. Winkler, A. Unterberg, C. Herold-Mende, T. Milde, A.E. Kulozik, M. Ebinger, M.U. Schuhmann, Y.-J. Cho, S.L. Pomeroy, A. von Deimling, O. Witt, M.D. Taylor, S. Wolf, M.A. Karajannis, C.G. Eberhart, W. Scheurlen, M. Hasselblatt, K.L. Ligon, M.W. Kieran, J.O. Korbel, M.-L. Yaspo, B. Brors, J. Felsberg, G. Reifenberger, V.P.

p. 13/23

[10]

[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]

[34]

Collins, N. Jabado, R. Eils, P. Lichter, S.M. Pfister, Recurrent somatic alterations of FGFR1 and NTRK2 in pilocytic astrocytoma, Nat. Genet. 45 (2013) 927–932. doi:10.1038/ng.2682. A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, S.R.F. Twigg, Wgs500 Consortium, A.O.M. Wilkie, G. McVean, G. Lunter, Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications, Nat. Genet. 46 (2014) 912–918. doi:10.1038/ng.3036. U. Toprak, R. Eils, M. Schlesner, SOPHIA workflow, (manuscript in preparation). K. Kleinheinz, R. Eils, M. Schlesner, ACEseq workflow, (manuscript in preparation). H. Li, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, ArXiv13033997 Q-Bio. (2013). http://arxiv.org/abs/1303.3997 (accessed February 15, 2017). H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform, Bioinforma. Oxf. Engl. 26 (2010) 589–595. doi:10.1093/bioinformatics/btp698. A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, T.R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics. 29 (2013) 15–21. doi:10.1093/bioinformatics/bts635. Picard Tools - By Broad Institute, (n.d.). http://broadinstitute.github.io/picard/ (accessed February 15, 2017). G. Tischler, S. Leonard, biobambam: tools for read pair collation based algorithms on BAM files, Source Code Biol. Med. 9 (2014) 13. doi:10.1186/1751-0473-9-13. A. Tarasov, A.J. Vilella, E. Cuppen, I.J. Nijman, P. Prins, Sambamba: fast processing of NGS alignment formats, Bioinformatics. 31 (2015) 2032–2034. doi:10.1093/bioinformatics/btv098. V. Hovestadt, MethylCtools, (unpublished). H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, The Sequence Alignment/Map format and SAMtools, Bioinformatics. 25 (2009) 2078– 2079. doi:10.1093/bioinformatics/btp352. D.S. DeLuca, J.Z. Levin, A. Sivachenko, T. Fennell, M.-D. Nazaire, C. Williams, M. Reich, W. Winckler, G. Getz, RNA-SeQC: RNA-seq metrics for quality control and process optimization, Bioinformatics. 28 (2012) 1530–1532. doi:10.1093/bioinformatics/bts196. K. Okonechnikov, A. Conesa, F. García-Alcalde, Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data, Bioinformatics. 32 (2016) 292–294. doi:10.1093/bioinformatics/btv566. BatchEuphoria: A library to access different kinds of cluster backends, eilslabs, 2017. https://github.com/eilslabs/BatchEuphoria (accessed August 3, 2017). J. Köster, S. Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics. 28 (2012) 2520–2522. doi:10.1093/bioinformatics/bts480. Galaxy Community Hub, (n.d.). https://www.galaxyproject.org/ (accessed August 3, 2017). J. Boekel, J.M. Chilton, I.R. Cooke, P.L. Horvatovich, P.D. Jagtap, L. Käll, J. Lehtiö, P. Lukasse, P.D. Moerland, T.J. Griffin, Multi-omic data analysis using Galaxy, Nat. Biotechnol. 33 (2015) 137–139. doi:10.1038/nbt.3134. Groovy programming language, (n.d.). http://www.groovy-lang.org/ (accessed February 15, 2017). Java® Virtual Machine Specification, (n.d.). https://docs.oracle.com/javase/specs/jvms/se8/html/ (accessed February 15, 2017). Grails Framework, (n.d.). https://www.grails.org/ (accessed February 15, 2017). PostgreSQL.org, (n.d.). https://www.postgresql.org/ (accessed February 15, 2017). Jenkins, (n.d.). https://jenkins.io/ (accessed February 15, 2017). piwik.org, Piwik Anal. Platf. (n.d.). http://piwik.org/ (accessed February 15, 2017). V. Bianchi, A. Ceol, A.G.E. Ogier, S. de Pretis, E. Galeota, K. Kishore, P. Bora, O. Croci, S. Campaner, B. Amati, M.J. Morelli, M. Pelizzola, Integrated Systems for NGS Data Management and Analysis: Open Issues and Available Solutions, Front. Genet. 7 (2016). doi:10.3389/fgene.2016.00075. A.S. McLellan, R.A. Dubin, Q. Jing, P.Ó. Broin, D. Moskowitz, M. Suzuki, R.B. Calder, J. Hargitai, A. Golden, J.M. Greally, The Wasp System: an open source environment for managing and analyzing genomic data, Genomics. 100 (2012) 345–351. doi:10.1016/j.ygeno.2012.08.005.

p. 14/23

[35] P. Wagle, M. Nikolić, P. Frommolt, QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation, BMC Genomics. 16 (2015) 487. doi:10.1186/s12864-0151695-x. [36] M.A. Kallio, J.T. Tuimala, T. Hupponen, P. Klemelä, M. Gentile, I. Scheinin, M. Koski, J. Käki, E.I. Korpelainen, Chipster: user-friendly analysis software for microarray and other highthroughput data, BMC Genomics. 12 (2011) 507. doi:10.1186/1471-2164-12-507. [37] K.M. Fisch, T. Meißner, L. Gioia, J.-C. Ducom, T.M. Carland, S. Loguercio, A.I. Su, Omics Pipe: a community-based framework for reproducible multi-omics data analysis, Bioinformatics. 31 (2015) 1724–1728. doi:10.1093/bioinformatics/btv061. Fig 1: An overview of all samples imported into OTP within the last five years is shown. The blue line represents all samples loaded in OTP. Each of the other lines represents another sequencing type. The graphic shows that the import of Exome and RNA data was quite constant while the import of whole genome sequencing (WGS) data increased. This can be explained by the fact that since late 2015 an increasing number of samples are produced by the Illumina HiSeq X Ten sequencers at the DKFZ. The reason for the rise of imported whole genome bisulfite (WGBS) samples is that a methylation calling pipeline is offered since 2016. The processing of ChIP-seq data was recently implemented. Fig 2: Overview of data processing in OTP. An email, with a specific structure, is sent to the service management system OTRS to trigger the automatic import of the data in OTP. Afterwards, further analyses, such as alignment or variant calling, are initiated within OTP. For processing, jobs are submitted to the cluster. The log and result files are written to the data storage system and quality control values and execution information are additionally stored in the database to allow displaying them to the user via OTP’s graphical user interface (GUI). After each analysis the users are notified automatically via email. Fig 3: Described are interactions between the different components integrated in OTP, needed for one analysis. (1) OTP collects all required information from the database, (2) provides the collected information to Roddy and triggers the analysis. (3) Roddy starts the analysis and submits jobs to the cluster, via BatchEuphoria (4). (3) The job identifiers are returned from the cluster to Roddy and (2) Roddy provides the job identifiers to OTP. (5) OTP queries the status of the jobs on the cluster again via BatchEuphoria (4). The checking of the job statuses is continued until all jobs are finished. Fig 4: Simplified overview of a workflow. Processing is divided into six subsequent steps. The “start step” continuously monitors the database for new data to process and creates processing objects (PrOb). The “execute step” executes the instructions from PrObs, delegating any compute-intensive tasks to the HTC cluster via Roddy. The “validation step” validates all result files. In case of successful validation the QC values are stored in the database in the “parse QC step” and the results are moved to the final location in the “transfer results step”. Finally, the “notification job” informs the customer via email. Fig 5: Overview of alignment QC values for one project. The most important QC values available after alignment are shown, including coverage, percentage of mapped reads and duplication rate. Threshold warnings are highlighted in red or orange. Patient pseudonyms in the first column and the project names in the search box are intentionally omitted here and in the following figures. Fig 6: Overview of SNV calling analysis for one project. Sample combinations and their current processing states are listed, with links to their result plots for inspection. Fig 7: Example of a job statistics overview. The overview shows averages of CPU, processing time and memory used, in this case for the SNV calling job. p. 15/23

p. 16/23

Figr-1

p. 17/23

Figr-2

p. 18/23

Figr-3

p. 19/23

Figr-4

p. 20/23

Figr-5

p. 21/23

Figr-6

p. 22/23

Figr-7

p. 23/23