Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms

Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms

Journal Pre-proof Bioinformatic strategies to address limitations of 16rRNA shortread amplicons from different sequencing platforms María Bailén, Car...

1MB Sizes 0 Downloads 15 Views

Journal Pre-proof Bioinformatic strategies to address limitations of 16rRNA shortread amplicons from different sequencing platforms

María Bailén, Carlo Bressa, Mar Larrosa, Rocío González-Soltero PII:

S0167-7012(19)30784-5

DOI:

https://doi.org/10.1016/j.mimet.2019.105811

Reference:

MIMET 105811

To appear in:

Journal of Microbiological Methods

Received date:

9 September 2019

Revised date:

14 December 2019

Accepted date:

15 December 2019

Please cite this article as: M. Bailén, C. Bressa, M. Larrosa, et al., Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms, Journal of Microbiological Methods (2019), https://doi.org/10.1016/ j.mimet.2019.105811

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier.

Journal Pre-proof Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms María Bailéna, Carlo Bressaa, Mar Larrosaa, Rocío González-Solteroa# a

Universidad Europea de Madrid, Faculty of Biomedical and Health Sciences,

-p

Corresponding author: Rocío González Soltero

ro

of

Nutrition, Microbiota and Health group, Villaviciosa de Odón, Madrid 28670, Spain.

re

[email protected]

lP

Keywords: mock, NGS platforms, metagenomic analysis pipelines, bacterial

Jo

ur

na

community, MiSeq, PGM.

Journal Pre-proof ABSTRACT: Sequencing the 16S gene rRNA has become a popular method when identifying bacterial communities. However, recent studies address differences in the characterization based on sample preparation, sequencing platforms, and data analysis. In this work, we tested some of the available user-friendly protocols for data analysis with the reads obtained from the sequencing machines Illumina MiSeq and Ion Torrent Personal Genome Machine (PGM). We sought for the advantages and disadvantages

of

of both platforms in terms of accuracy, detected species, and abundance, analyzing a

ro

staggered mock community. Four different pipelines were applied: QIIME 1.9.1 with

-p

default parameters, QIIME 1.9.1 with modified parameters and chimera removal,

re

VSEARCH 2.3.4, and QIIME 2 v.2018.2. To address the limitations of species level detection we used species-classifier SPINGO. The optimal pipeline for PGM platform,

lP

was the use of QIIME 1.9.1 with default parameters (QIIME1), except when a study

na

requires the detection of Bacteroides or other Bacteroidaceae members, in which QIIME1MOD (with chimera removal) seems to be a good alternative. For Illumina

ur

Miseq, VSEARCH strategy can be a good option. Our results also confirm that all the

Jo

tested pipelines can be used for metagenomic analysis at family and genus level.

Journal Pre-proof 1. Introduction 1.1. Limitations of 16S rRNA short-read amplicons sequencing Sequencing of the 16S rRNA gene is the most popular method for characterizing the complexity

of microbial communities.

The

latest

advances in high-throughput

sequencing techniques and next-generation (NGS) sequencing platforms allow users to assign taxonomy considering the hypervariable regions contained in the 16S rRNA

of

gene. Studies suggest that 16S rRNA gene sequencing provides genus identification in

ro

most cases but less so with regard to species (Janda and Abbott, 2007). It has been

-p

pointed out that a key limitation of using 16S rRNA gene sequencing is that related

re

bacterial species may be indistinguishable due to the homology between sequences

lP

(Plummer et al., 2015). Sequencing of hypervariable sub-regions of this gene produces single short-reads of 100-500 nucleotides, when the gene contains 1542 base pairs, variations out of sequencing area (Zhang et al., 2011). Although some sub-

na

missing

regions provide a good approximation of 16S diversity, most do not distinguish

ur

polymorphisms between closely related taxa, which could be avoided by the complete

Jo

sequencing of the gene (Johnson et al., 2019). Also, the presence of multiple intragenomic copies of this gene (1 up to 5 or more), which could carry subtle nucleotide substitutions,

may overestimate microbial abundance (Johnson et al., 2019;

Větrovský and Baldrian, 2013). 1.2. Comparison of platforms using short-reading lengths Another source of bias when selecting sequencing platforms and the subsequent data processing is the software of choice and the analytical protocol (pipeline). It has been previously reported that the employment of different platforms can yield differences in

Journal Pre-proof sample characterization depending on their accuracy, sensitivity and error rate (Claesson et al., 2017). Up to date, no standard evaluations studies exist to select the best

platform in terms of accuracy or sensitivity when evaluating complex

communities. Escobar-Zepeda et al., have recently publish a benchmark tool that looks into error rates and sensitivity (Escobar-Zepeda et al., 2018). Loman and co-workers showed that Illumina MiSeq yielded the lowest error rate when compared with both

of

Ion Torrent Personal Genome Machine (PGM) and the Roche 454 platform (Loman et

ro

al., 2012). Nowadays, the Roche 454 has been discontinued whereas PGM and

-p

Illumina MiSeq bench-top versions are the platforms of election for researchers

re

starting microbiota characterization projects, and the prevalence of Illumina over PGM has significantly increased during the last years (Escobar-Zepeda et al., 2015). There

lP

have been some improvements on the PGM platform with the availability of new

na

sequencing kits and chips that decrease the lower indel error rates, but the Illumina MiSeq platform seems to be better avoiding indel errors (Jünemann et al., 2013). The

ur

Illumina MiSeq technology uses an approach of sequencing by synthesis, utilizing

Jo

fluorescently labeled reversible-terminator nucleotides and, depending on the model, it takes several days to obtain the data (Pylro et al., 2014). It is probably the most popular sequencing platform, in view of its competitive cost, lower than in the case of the PGM machine (Escobar-Zepeda et al., 2015). For Illumina MiSeq, the hypervariable regions V3-V4 are the most used for microbiota characterization analysis. The amplicon is generated by paired-ends reads covering around 460 nucleotides and allowing a certain degree of overlap in the readings at its 3'end, the area of greatest loss of quality (Barb et al., 2016; Claesson et al., 2010; Klindworth et al., 2013; Quail et al., 2012).

Journal Pre-proof Considering the sequencing protocols, the PGM platform seems to be the best for characterization of hypervariable regions (V1-V9), due to the sequencing of all the regions at the same time by the most recent versions of their chips and larger singleread sequences (Barb et al., 2016). Due to this coverage, Quail et al. found that the performance of the PGM machine was better, but still at the expense of a higher false positive rate (Quail et al., 2012).

of

1.3. Comparative analysis of pipelines for microbiota data

ro

Analytical pipelines applied to data processing can also introduce bias factors, as not

-p

all available pipelines include the same steps (Allali et al., 2017). For a non-expert

re

bioinformatician, to start a microbial community analysis project requires some initial input into the available algorithms, software and pipelines. There is limited literature

lP

comparing the advantages and disadvantages of available software packages. The

na

choice depends on the level of taxonomic depth and complexity to be achieved in a study. Nilakanta et al., concluded that QIIME and Mothur where the most

ur

comprehensive pipelines and both include features and support documentation

Jo

(Nilakanta et al., 2014). D’Argenio et al. did not find statistically significant differences in the taxonomic assignment and alpha diversity measures when using MG-RAST and QIIME, but stated that QIIME produced more accurate assignments (D’Argenio et al., 2014). QIIME1 protocols were initially established for 454 pyrosequencing raw data and later adapted to Illumina technology (Caporaso et al., 2012). Later, Bokulich 2013 presented new guidelines trying to improve default parameters for quality filtering of Illumina reads (Bokulich et al., 2013). Now QIIME is under the transition to QIIME2, where important changes have been made in terms

Journal Pre-proof of overrepresentation of OTUs, characterization (now “features”) and other quality related factors (Bolyen et al., 2018). VSEARCH is an open source alternative to the USEARCH tool and can be implemented in the QIIME package. It includes chimera detection, paired-end reads merging and dereplication (Rognes et al., 2016) and was designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not free.

of

SPINGO is a novel method to annotate sequences at genus or species level (Allard et

ro

al., 2015). Considering the results reported by Allard et al., and more recently by

-p

Escobar-Zepeda the use of SPINGO based on the RDP database seems a good method

re

to report high accuracy at the species level (Bokulich et al., 2013; Escobar-Zepeda et al., 2018). Escobar-Zepeda considered also the use of different databases concluding

lP

small, but more curated databases, like RDP improved and their results in terms of

na

sensitivity, specificity and accuracy (Escobar-Zepeda et al., 2018). After all those studies, there is not yet a well-established protocol to decide which

ur

platform adapts better to specific research topics and there are not defined protocols

Jo

considering all the sources of constraints detailed above, especially for novices in the field. The main objective of this study was to analyze a known bacterial community (mock) in a pilot study using two common benchtop sequencing platforms that are currently on the market and using four different analysis pipelines: QIIME1, QIIME1MOD, VSEARCH and QIIME2. The use of a standardized mock allows to measure directly the sequencing and analytical methods without considering the bias introduced by sample preparation. Also, to increase taxonomic resolution at species level of the studied mock community, the species classifier SPINGO was used.

Journal Pre-proof

2. Material and methods 2.1. Mock microbial community Microbial Mock Community B (HM-783D, staggered, low concentration, v5.2L, for 16SrRNA gene sequencing) contained known operons counts of 20 species as per which can be obtained from BEI (Manassas,

of

Certificate of Analysis for HM-783D,

ro

Virginia, U.S.A.)(BEI Resources, 2014), including Gram-positive and Gram-negative

-p

bacteria. The sample contained a staggered number of ribosomal RNA operon counts

re

from 103 to 106 .

na

2.2.1. Illumina MiSeq Analysis

lP

2.2. Sequencing

Bacterial 16S rRNA gene analysis was performed by amplifying the hypervariable

ur

regions V3-V4 of the 16s rRNA gene using the primer pair:

Jo

5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3' and 5'- 165 GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3’ (Klindworth et al., 2013).

Amplification was done with a GeneAmp PCR System 9700 thermocycler (Applied BioSystems, Foster City, CA, USA). PCR reactions were carried out following the manufacturer’s instructions. PCR products were analyzed in a 1% agarose gel. Before sequencing, amplicons were purified with the MinElute Gel extraction kit (Qiagen, Hilden, Germany). For sequencing, the Illumina MiSeq version 3 (600 cycles) platform was used (Illumina, San Diego, CA) (Claesson et al., 2010).

Journal Pre-proof 2.2.2. Ion Torrent PGM Analysis Similarly, 16S region was amplified with Ion 16S T M Metagenomics Kit (Life Technologies, Spain) by two separated PCR reactions using primer sets: V2–4–8 and V3, V6-7 y 9. The primer sets: V2–4–8 and V3, V6-7 and 9 refers to as primer set 1, which includes primers for amplification of regions: v2, v4 and v8 (V2–4–8), and primer set 2 which includes primers for amplification of regions: v3, v 6 and v7 in a

of

single amplicon, and v9. Equal volumes of the V2, V4, V8, V3, V6-7, and V9

ro

amplifications reactions were combined. Obtained amplicons were processed to make

-p

the DNA library using the KAPA Library Preparation Kit for Ion Torrent Platforms.

re

Library quantification was carried out by Qubit dsDNA Assay Kit. Library quality was checked by agarose gel electrophoresis. PCRs were done following manufacturer’s

lP

indications as previously reported by Barb et al., 2016. Sequences were purified using

na

Agencourt AMPure XP Reagent for mock and Axy Prep PCR Clean-up. Emulsion PCRs were analyzed using the Ion PGM Hi Q Template kit according to the

ur

manufacturer’s instructions. Sequencing of the amplicon library was carried out on an

Jo

Ion 314T M Chip on the Ion Personal Genome Machine (PGM). After sequencing, all reads were filtered to remove those of low quality. Multiplex PCR reactions were subsequently combined and further processed as a single sample containing various amplicons. Demultiplexing as well as barcode trimming and basic quality trimming occur by default at the end of sequencing run and are performed still on the sequencer. 2.3. Bioinformatics The software QIIME (Quantitative Insights Into Microbial Ecology) was used to assess the platform comparison (Caporaso et al., 2010; Rideout et al., 2018). We

Journal Pre-proof compared four different data analyses strategies or pipelines to find the one that best fit our mock composition. The objective is to use the mock as baseline to test running conditions and to select the better procedure with the rest of the samples. Considering that these days there is a transition from QIIME to QIIME2, both the QIIME v1.9 and QIIME 2 v18.2, April 2018, were applied to compare the obtained results (Table 1). We also used VSEARCH, the new pipeline of the BMP (Brazilian Microbiome

of

Project) that is available online (http://www.brmicrobiome.org/) including protocols

ro

and guidelines for analyzing data coming from the two platforms (Illumina MiSeq and

-p

PGM).

re

Sequences were clustered into bins called ‘Operational Taxonomic Units’ (OTUs) based upon similarity. This similarity between a pair of sequences is computed as the

lP

percentage of sites that agree in a pairwise sequence alignment. A common similarity

na

threshold used is 97% (Nguyen et al., 2016).

For all these pipelines, the taxonomic classification was conducted using Greengenes

ur

13_8 97%, which seems accurate to genus level but loses specificity at the species

Jo

level (Yarza et al., 2014).

The two Illumina sequencing runs (paired-end reads) were assembled with fastq format using the script join_paired_ends.py (Aronesty, 2013). Single-ends PGM reads were directly introduced in the pipeline as an individual sequence. Subsequently, the data analysis was done as follows and the main scripts and commands were summarized in Table 1: a) Pipeline 1 (QIIME default parameters; QIIME v1.9.1 (Pylro et al., 2014)). The output file was pre-processed for quality filtering with Q20 as the minimum

Journal Pre-proof Phred quality score (q=19), using the QIIME script (split_libraries_fastq.py) (Bokulich et al., 2013; Caporaso et al., 2012). OTUs were assigned using openreference OTU picking protocol (pick_open_reference_otus.py) which also runs

assign_taxonomy.py

with the UCLUST algorithm as default and

Greengenes13_8 as reference database. During this step sequences were clustered into OTUs at 97% similarity.

of

b) Pipeline 2 (QIIME modified parameters; QIIME1_MOD (Edgar et al., 2011)).

ro

This strategy comprises as upstream following data pretreatment as for pipeline

-p

1 (split_libraries_fastq.py; quality filtering Q20). The downstream analysis

re

follows as indicated in Table 1: OTU picking through pick_otus.py->selection of representative sequences->Chimera detection using the Chimera Slayer

lP

method->taxonomic assignments using the UCLUST method (Edgar, 2010).

na

c) Pipeline 3 (Pipeline 2017 Pylro). This pipeline uses VSEARCH, an updated version of the USEARCH pipeline available in the web of BMP (version

ur

November 2017) (Pylro et al., 2016, 2014; Rognes et al., 2016). Data

Jo

pretreatment includes the same script (split_libraries_fastq.py) but, in the case of Illumina data, filters were turned off (Phred quality score q=0) and default parameters applied for PGM data analysis. A fastq_maxee 1.0 was also applied to data and later, singletons were discarded before the OTU assignment. OTU clustering method was UPARSE (-id 0.97). The taxonomic assignment for Illumina data was done as indicated in the pipeline 1. d) Pipeline 4 (QIIME2 v.2018.2): raw sequence data were imported into QIIME 2 software using the QIIME2 command-line interphase (q2cli) to import data with the QIIME tools import method. The data were trimmed, truncated,

Journal Pre-proof quality filtered, dereplicated, and chimeras removed with the QIIME2 plugin DADA2 with p-max-ee 1.0 (denoised-paired for Illumina and denoise-single for PGM) (Callahan et al., 2016). DADA2 is used for modeling and correcting amplicon errors and includes quality filtering, dereplication, and chimera identification. Representative sequence sets for each DADA2 sequence variant were used for taxonomic classification (QIIME 2 plugin feature-classifier

of

(Bokulich et al., 2018) with the sklearn method (Pedregosa et al., 2011). The

ro

above pipelines (1-4) used the Greengenes v.13_8 database for taxonomic

-p

assignment.

re

For diversity analysis with the tested pipelines alpha-rarefaction plugin (implemented

lP

in QIIME2) was used (Bolyen et al., 2018).

To address these limitations and obtain a higher taxonomic resolution, we used

na

species-classifier SPINGO that uses a custom database, based on RDP (Ribosomal Database Project). This software application can classify assignable species sampled

ur

form any environment. Quality filtering of paired-ends and single-end reads with

Jo

SPINGO was done as in the pipelines 1 and 3 with Q20. A reads.fa file was created with the filtered sequences and then classified by using the species classifier SPINGO version 1.3 64 bits with default parameters and its own database based on RDP (Allard et al., 2015). 2.4. Statistical comparison of pipelines and sequencing platforms To compare the performance of each pipeline with data from the two different sequencing platforms, taxonomic summary files were compared with the known staggered mock sample by computing Spearman’s rank correlation coefficient and

Journal Pre-proof Kullback-Lieber

divergence.

Kullback-Lieber

divergence

(DKL)

measures

the

difference between two probability distributions, where a larger DKL indicates a greater divergence between distributions (Kullback and Leibler, 1951) whereas the Spearman’s coefficient is used to discover the strength of a link between two data sets. Also, Wilcoxon signed-rank test was used to evaluate statistical differences between observed and expected taxa at family, genus and species level, and contingency

of

analysis was used to evaluate differences between true positive and false positive taxa

ro

(Table 2).

-p

The parameters taxon accuracy rate (TAR) and taxon detection rate (TDR), which are

re

useful alternatives, rely on the presence/absence of expected taxa calculated. TDR is the number of true positive features divided by the total number of expected features,

lP

and TAR is the number of true positive features divided by the total number of

na

observed features. Considering taxa abundance unassigned, under-classified and misclassified taxa were used. Unassigned are non- identified features, under-classified

ur

are features expected identified only in upper taxa whilst misclassified taxa are

Jo

observed features that were not expected (Bokulich et al., 2018). Alpha diversity metrics are widely used for the characterization of communities. It has two components, species richness and equitability indices. To study species richness, rarefaction plots of observed OTUs (OTUs actually observed in samples) and chao1 (OTUs expected in samples) were used. For equitability, Pielou’s evenness studies the uniformity of the population size of each of the species (Jost, 2010). Shannon diversity estimates both species richness and evenness. This index is commonly used to characterize species diversity and assumes that all species are represented in a sample if they are randomly sampled (Kim et al., 2017).

Journal Pre-proof

3.

RESULTS

3.1. Benchmarking different analytical protocols using a mock community:

of

advantages and disadvantages

ro

Mock communities are useful for method benchmarking as the actual composition of a

The sequenced mock sample provides a

re

community profiling with accuracy.

-p

mock community is known in advance and allows quantitative assessments of

theoretical community to be compared to the observed after sequencing. The genomic

lP

DNA from the microbial Mock Community HM-783D contains twenty individual

na

species with operons ranging from 1,000 to 1,000,000 (Staggered), being five species at the lowest operon level: Actinomyces odontolitycus, Bacteroides vulgatus,

ur

Deinococcus radiodurans, Enterococcus faecalis, and Streptococcus pneumoniae.

Jo

Considering the expected and detected bacteria, we compared the results at family, genus and species level after sequencing the mock sample in the PGM and Illumina MiSeq v3 (600 cycles) machines (Table 2). At family level, all pipelines for both platforms detected the 17 expected families, except for pipelines QIIME1-PGM, QIIME2-ILL

and

QIIME2-PGM,

missing

Bacteriodaceae,

Listeriaceae

and

Enterococcaceae respectively (Table 2 and Figs. 1 and 1s). Overestimated families by most of the pipelines were Bacteroidaceae, Helicobacteriaceae, Enterobacteriaceae and Pseudomonaceae. The under-classified OTUs (expected features identified only in upper taxa) were reduced drastically at family level compared to genus and species.

Journal Pre-proof QIIME 2-PGM, presented the highest percentage of unassigned and under-classified OTUS with no misclassifications (observed features that were not expected) at family level. Also, analysis of data reported by Illumina produce higher percentages of misclassifications than data from PGM (Fig. 1e). At genus level, Escherichia was only detected by QIIME1 (both platforms) and QIIME1-ILL was the only strategy able to detect the 17 expected genera. Other not detected genera were Bacteroides by

of

QIIME1-PGM, Listeria by QIIME2-ILL and Enterococcus by QIIME2-PGM (Table

ro

2, Figure 2). There were no unassigned OTUs for QIIME 2-Illumina and considering

-p

the misclassified and under-classified OTUs QIIME 2 data presented the highest

re

percentages as it was seen at family level (Fig. 2e).

Finally, at species level, only 25-40% species (4-8 species) were identified by the

lP

different bioinformatic strategies (Table 2).

na

Considering the false positive (FP) and false negative (FN) taxa, QIIME 2 produce the highest FN and the lowest FP taxa at all levels as opposed to QIIME1-ILL with the

ur

highest FP and lowest FN of all the tested strategies (Table 2). Contingency analysis

Jo

did not find statistical differences between false positive and false negative taxa at each of the studied pipelines. To compare the taxonomic identification performed at family, genus and species levels, OTU counts were summarized in an incidence map (Fig. 4), and the 20 species presented in the mock sample were classified according to the operon count. Any OTU not classified up to the family level was labelled as “non-detected.” Our results showed that Helicobacter pylori and Propionibacterium acnes were the only two species detected by all strategies/platforms, with an operon count of 105 for both species.

Also,

Bacillus cereus,

Staphylococcus aureus, Staphylococcus

Journal Pre-proof epidermidis,

Streptococcus

agalactiae,

and

Rhodobacter sphaeroides,

which

appeared at high operons counts (105 -106 ), were detected by most of the strategies/ platforms (Fig. 2) at species level. Limitations in detection were found for Escherichia coli that was only detected at the species level by QIIME1 (rest of pipelines

only

difficulties

achieve

detecting

family

level),

Enterococcus

QIIME1-PGM

faecalis

and

(from PGM

QIIME2 data)

and

showed Listeria

of

monocytogenes (from Illumina data) at family, genus and species levels. The same

ro

was observed in the case of Bacteroides vulgatus following the QIIME1-PGM

-p

(Figure 3).

re

To quantify the distribution of observed and expected taxa, the DKL was calculated at family, genus and species levels. In addition, Spearman's rank correlation coefficient

lP

was calculated between observed and expected taxa (Table 3). At family level, the

na

lowest DKL were VSEARCH and QIIME1, in the case of Illumina and QIIME1 for PGM. At genus level, the lowest DKL was obtained by QIIME1 in both Illumina and

ur

PGM. At species level, divergence range from 8 to 16.

Jo

Results of correlation were statistically significant at family and genus level but not for species being QIIME1-PGM and VSEARCH-ILL, the both family and genus. Moreover, Wilcoxon signed-rank test was used to evaluate the differences between observed and expected taxa without statistical differences at family and genus level. Only at species level we found statistical differences, except for QIIME1-PGM (data not shown). Finally, a heatmap of relative abundance at family, genus and species level was generated (Fig. 4). Only strategies using Illumina were clustered at family level with

Journal Pre-proof the expected mock, except for QIIME2 (Fig. 4a). At genus and species level (Fig 4b and 4c), none of the strategies were clustered with the mock. 3.2. Accuracy and Sensitivity: Taxon Detection Rate and Taxon Accuracy Rate For mock sample characterizations, Taxon Accuracy Rate (TAR) has been used to evaluate accuracy in taxonomy classifications, and Taxon Detection Rate (TDR) as an approximation

for

sensitivity

measures

and

provide comparable

results when

of

analyzing mock communities (Bokulich et al., 2018). All the pipelines showed high

ro

TDR score, being QIIME2 the one with the lowest TDR in the two platforms (Fig. 5a).

-p

On the contrary, QIIME2 reported the highest accuracy (TAR) at all levels (Fig. 5b).

re

3.3. Species identification by the species classifier SPINGO Identification of OTUs to species level is quite relevant when studying microbiota, as

lP

some species from the same genus may be pathogenic and other commensals. To

na

increase taxonomic resolution at species level of the mock community studied we have used Species classifier SPINGO. The analyses of the mock community by SPINGO

Jo

Illumina (Table 4).

ur

detected 90-100% of the expected species, 18 from the PGM platform and 20 from

Considering the detectability threshold, SPINGO-ILL could detect the five species at low operon levels, and SPINGO-PGM only three, where neither Streptococcus pneumoniae

nor Enterococcus faecalis

were detected.

Three species were

overestimated, Helicobacter pylori and Pseudomonas aeruginosa for ILL and Bacteroides vulgatus by PGM (Fig. 3s). A heatmap of relative abundance at species level for SPINGO is shown at Fig. 5, where a unique cluster between the two pipelines can be seen. Wilcoxon signed-rank test reported statistical differences between SPINGO-PGM but not for SPINGO-ILL.

Journal Pre-proof 3.4. Comparing technologies using Alpha diversity approaches As the sequencing depth can affect alpha diversity measures, analyses were done using rarefaction curves, and the number of reads was standardized for all samples and platforms. Rarefaction analyses were performed at a maximum sampling depth of 29,421 reads. Rarefaction curves of observed species began to plateau, indicating enough sequencing depth. To compare how diversity and evenness of the observed

of

data match the expected data, rarefaction of observed species, Chao1 index, Pielou’s

ro

evenness and the Shannon diversity metrics were performed (Fig. 7).

-p

The highest richness (chao1 and observed OTUs) was observed by QIIME1 MOD-

re

PGM, followed by QIIME1-ILL, and lowest with QIIME 2-ILL (Fig. 7 a and b).

lP

Conversely, results of equitability by Pielou`s and Shannon index, indicated higher alpha diversity for QIIME2-PGM, and lowest for QIIME1 MOD-ILL. Strategies using

na

data from PGM presented a higher evenness and Shannon index than those for Illumina (Fig. 7 c,d).

ur

Increased diversity was observed by PGM, and regarding the pipelines, by QIIME2-

Jo

PMG, when both species richness and evenness were considered as reported by Shannon diversity index. 4. DISCUSSION One of the most important decisions a novice researcher must take before starting a project based on microbiota characterization is to determine the sequencing platform and how the upcoming data will be analyzed, using the available wrapped protocols. The goal of this study was to identify key parameters to define a bioinformatics strategy for analyzing data coming from short-reads sequences of the hypervariable

Journal Pre-proof regions of the 16S rRNA gene, obtained from two different sequencing platforms, PGM and MySeq. Different approaches were used to characterize a mock community (low diversity sample). Mock communities contain well-characterized species and can be useful for novice users to test their results and become familiar with the protocols. 16s RNA gene sequencing, albeit with some limitations such as multiple gene copies (Carlos et al., 2012; Větrovský and Baldrian, 2013), is the most used tool to identify

of

bacterial communities, but frequently limited to genus level of identification (Johnson

ro

et al., 2019; Noecker et al., 2017). If the objective of the study is characterization at

-p

the species level, then short-read sequencing might not be the best solution, as

re

evidenced by our statistical analysis between expected and observed species (Table 2). These results agree with those recently reported by Johnson et al. (2019). It has also

lP

recently been reported that incorporating environment-specific taxonomic abundance

na

information increases species level classification accuracy across common sample types (Kaehler et al., 2018).

ur

The platform of choice sometimes depends on budget or availability, there being a

Jo

considerable difference in cost between the second and third-generation sequencing platforms currently on the market. In our case, we compared Illumina MiSeq and PGM platforms (second-generation with a lower cost). Regarding the software, there is limited literature comparing the advantages and disadvantages of available software packages (D’Argenio et al., 2014; Nilakanta et al., 2014). For a non-experienced bioinformatician, to start a microbiota project requires some insights in the available methods. When starting our project, we selected the QIIME toolkit as is open-source and has an enormous network of users. Also, the

Journal Pre-proof developers use to provide fast feedback to the users when using their code (https://forum.qiime2.org/). It matters to say that during our study this software was undergoing a transition to QIIME2 (QIIME1 has been out of service since January 2018) (Bolyen et al., 2018). Considering this situation, we used both, the old and the newer version. The reported results were not as comparable as they should be. The transition to QIIME 2 also changes the well-known system of OTUs classification.

of

The new taxonomic unit assignment method is known as “feature classifier”. QIIME

ro

protocols were initially established for 454 pyrosequencing raw data and adapted to

-p

Illumina technology due to an increase in the use of the Illumina sequencing machine

re

(Caporaso et al., 2012). Later, Bokulich presented new guidelines trying to improve default parameters for quality filtering of Illumina reads (Bokulich et al., 2013), being

lP

some of them recently discussed by Edgar, (2017). Due to that, we include VSEARCH

na

(the free version of USEARCH) in our analysis. Our results reported that when using QIIME 2, fewer taxa were identified (low TDR

ur

and richness), but with higher accuracy (high TAR). The missing information when

Jo

using QIIME2 may be related to the use of DADA2 denoising methods, not comparable to quality filtering methods. DADA2 denoising is specialized in reducing the presence of false positives and could make difficult to detect low operon count species. DADA2 seems to be a good choice for assigning identity with high accuracy using high-quality data. Regarding the PGM platform, the use of QIIME 1 seemed to be the optimal pipeline in terms of correlation and divergence, except for analysis requiring the detection of Bacteroides or other Bacteroidaceae members, as gut microbiota studies. In these

Journal Pre-proof specific situations, QIIME1MOD (with chimera removal) seems to be a good alternative (Allali et al., 2017).Concerning Illumina MiSeq, VSEARCH can be a good option as it showed low divergence and high correlation at the family and genus levels. Similar results were reported by QIIME1. Most microbiota characterization studies have the genus identification level as a target. We must point out that VSEARCH did not detect the genus Escherichia. Something

of

similar happened in the case of QIIME2, were neither Escherichia, Enterococcus nor

observed,

particularly

in

the

classification

of members

from the

-p

previously

ro

Listeria were detected. Differences at the family and genus levels have been

re

Enterobacteriaceae family (Barb et al., 2016). The characterization at species level of identification is key when analyzing clinical samples where distinguishing between

lP

two species in one genus can be important in terms of clinical decisions (Claesson et

na

al., 2017). In Clinical Microbiology, the use of PGM, based on ion semiconductor technology approach that reports outputs with run times of two hours, could provide a

ur

good advantage (Claesson et al., 2017). However, as showed in this study, PGM

Jo

presents limitation in characterizing the following species Bacteroides vulgatus or Enterococcus faecalis at any level (Fig. 3). Overall, albeit considering these limitations, clinical identification based on 16S rRNA gene shows a genus-level concordance rate of 96% and a species-level concordance rate of 87.5% (Srinivasan et al., 2015). It has recently been reported that the use of third-generation sequencing platforms such as PacBio CGC targeting the full-length gene sequence of the 16S rRNA gene could improve the identification rate at species level (Johnson et al., 2019).

Journal Pre-proof In labs with the abovementioned technology reporting data from short-read sequencing platforms, identification at species level can be improved when using appropriate software such as SPINGO. SPINGO has been reported to be a standalone, rapid, and accurate classification for samples from different environments. Recently, EscobarCepeda et al., reported that SPINGO was the method with the highest accuracy at species level when using RDP database (Escobar-Zepeda et al., 2018). Although with

of

some restrictions in the relative abundance calculation due to the problems discussed,

ro

second-generation NGS can still be a good strategy when comparing two different

-p

populations (patients vs. controls; treated vs. untreated, etc.). In our study, SPINGO

re

detected 90-100% of the expected species, also using RDP in data from Illumina MiSeq.

lP

Using the results of our studies, and in order to provide a benchmark analysis guide for

na

non-professional bioinformatician research, Table 5 contains the “pros”, “cons”, and “indications” of the two sequencing platforms and analytical pipelines.

ur

To conclude, no statistical differences were found between observed and expected taxa

Jo

up to genus level, albeit with small variations, indicating that all of them can be used for metagenomic analysis at this level. However, we may consider individual limitations in the identification of specific genus and many restrictions at species level, SPINGO being the best approach at this level. Presumably, the sequencing of fulllength 16S intragenomic copy variants has the potential to provide taxonomic resolution of bacterial communities at species and strain level. Authors’ contribution

Journal Pre-proof MB, CB and RGS design the experiments. CB perform the experiments. MB, CB, ML and RGS analyze the data. MB and RGS wrote the paper. CB and ML reviewed the paper and make substantial contributions to the final version. Data availability The raw sequencing data can be requested by researchers and will be available after

of

the deposition in the NCBI public database shortly.

ro

Acknowledgments:

-p

This research was supported by the Ramón y Cajal grant 2012_11910 and the

re

AGL2016-77288-R project from the Ministry of Economy and Competitiveness, Spain, the “Cátedra ASISA” project 2015/UEM46, and the 2013/UEM19 and projects from the Universidad

lP

2015/UEM06

Europea de Madrid.

We also

ur

STABvida, Portugal.

na

acknowledge the services of Parque Científico de Madrid, Spain and the company,

Jo

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Nº Secuencia SUB6291805 https://submit.ncbi.nlm.nih.gov/subs/sra/SUB6291805/files REFERENCES Allali, I., Arnold, J.W., Roach, J., Cadenas, M.B., Butz, N., Hassan, H.M., Koci, M., Ballou, A., Mendoza, M., Ali, R., Azcarate-Peril, M.A., 2017. A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of

Journal Pre-proof the gut microbiome. BMC Microbiol. 17, 1–16. https://doi.org/10.1186/s12866017-1101-8 Allard, G., Ryan, F.J., Jeffery, I.B., Claesson, M.J., 2015. SPINGO: A rapid speciesclassifier for microbial amplicon sequences. BMC Bioinformatics 16, 1–8. https://doi.org/10.1186/s12859-015-0747-1

of

Aronesty, E., 2013. Comparison of Sequencing Utility Programs. Open Bioinforma. J.

ro

7, 1–8. https://doi.org/10.2174/1875036201307010001

-p

Barb, J.J., Oler, A.J., Kim, H.S., Chalmers, N., Wallen, G.R., Cashion, A., Munson, P.J., Ames, N.J., 2016. Development of an Analysis Pipeline Characterizing

re

Multiple Hypervariable Regions of 16S rRNA Using Mock Samples. PLoS One

lP

11, 1–18. https://doi.org/10.1371/journal.pone.0148047

na

BEI Resources, 2014. The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from

ur

Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S

Jo

rRNA Gene Sequencing, HM-783D. BEI Resources,10801 Uni. Bokulich, N.A., Kaehler, B.D., Rideout, J.R., Dillon, M., Bolyen, E., Knight, R., Huttley, G.A., Caporaso, J.G., 2018. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2 ’ s q2-feature-classifier plugin. Microbiome 6, 90. https://doi.org/10.1186/s40168-018-0470-z Bokulich, N.A., Subramanian, S., Faith, J.J., Gevers, D., Gordon, J.I., Knight, R., Mills, D.A., Caporaso, J.G., 2013. Quality- filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat. Methods 10, 57–59.

Journal Pre-proof https://doi.org/10.1038/nmeth.2276 Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., Bai, Y., Bisanz, J.E., Bittinger, K., Brejnrod, A., Brislawn, C.J., Brown, C.T., Callahan, B.J., Caraballo-Rodríguez, A.M., Chase, J., Cope, E., Silva, R. Da, Dorrestein, P.C., Douglas, G.M., Durall, D.M., Duvallet, C., Edwardson, C.F., Ernst, M., Estaki,

of

M., Fouquier, J., Gauglitz, J.M., Gibson, D.L., Gonzalez, A., Gorlick, K., Guo, J.,

ro

Hillmann, B., Holmes, S., Holste, H., Huttenhower, C., Huttley, G., Janssen, S.,

-p

Jarmusch, A.K., Jiang, L., Kaehler, B., Kang, K. Bin, Keefe, C.R., Keim, P.,

re

Kelley, S.T., Knights, D., Koester, I., Kosciolek, T., Kreps, J., Langille, M.G.,

lP

Lee, J., Ley, R., Liu, Y.-X., Loftfield, E., Lozupone, C., Maher, M., Marotz, C., Martin, B.D., McDonald, D., McIver, L.J., Melnik, A. V, Metcalf, J.L., Morgan,

na

S.C., Morton, J., Naimey, A.T., Navas-Molina, J.A., Nothias, L.F., Orchanian, S.B., Pearson, T., Peoples, S.L., Petras, D., Preuss, M.L., Pruesse, E., Rasmussen,

ur

L.B., Rivers, A., Michael S Robeson, I., Rosenthal, P., Segata, N., Shaffer, M.,

Jo

Shiffer, A., Sinha, R., Song, S.J., Spear, J.R., Swafford, A.D., Thompson, L.R., Torres, P.J., Trinh, P., Tripathi, A., Turnbaugh, P.J., Ul-Hasan, S., Hooft, J.J. van der, Vargas, F., Vázquez-Baeza, Y., Vogtmann, E., Hippel, M. von, Walters, W., Wan, Y., Wang, M., Warren, J., Weber, K.C., Williamson, C.H., Willis, A.D., Xu, Z.Z., Zaneveld, J.R., Zhang, Y., Zhu, Q., Knight, R., Caporaso, J.G., 2018. QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. https://doi.org/10.7287/peerj.preprints.27295v2 Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., Holmes,

Journal Pre-proof S.P., 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583. https://doi.org/10.1038/nmeth.3869 Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I., Huttley, G.A., Kelley, S.T., Knights, D., Koenig, J.E., Ley, R.E., Lozupone, C.A., McDonald, D., Muegge, B.D., Pirrung, M., Reeder, J., Sevinsky, J.R., Turnbaugh, P.J., Walters,

of

W.A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R., 2010. QIIME

-p

335–6. https://doi.org/10.1038/nmeth.f.303

ro

allows analysis of high-throughput community sequencing data. Nat. Methods 7,

re

Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Huntley, J., Fierer, N.,

lP

Owens, S.M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J.A., Smith, G., Knight, R., 2012. Ultra-high-throughput microbial community analysis on the

na

Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621–1624.

ur

https://doi.org/10.1038/ismej.2012.8

Jo

Carlos, N., Tang, Y.-W., Pei, Z., 2012. Pearls and pitfalls of genomics-based microbiome analysis. Emerg. Microbes Infect. 1, 1–3. https://doi.org/10.1038/emi.2012.41 Claesson, M.J., Clooney, A.G., O’Toole, P.W., 2017. A clinician’s guide to microbiome analysis. Nat. Rev. Gastroenterol. Hepatol. 14, 585–595. https://doi.org/10.1038/nrgastro.2017.97 Claesson, M.J., Wang, Q., O’Sullivan, O., Greene-Diniz, R., Cole, J.R., Ross, R.P., O’Toole, P.W., 2010. Comparison of two next-generation sequencing

Journal Pre-proof technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Res. 38, e200–e200. https://doi.org/10.1093/nar/gkq873 D’Argenio, V., Casaburi, G., Precone, V., Salvatore, F., 2014. Comparative metagenomic analysis of human gut microbiome composition using two different bioinformatic pipelines. Biomed Res. Int. 2014, 325340.

ro

of

https://doi.org/10.1155/2014/325340

Edgar, R.C., 2017. Accuracy of microbial community diversity estimated by closed-

-p

and open-reference OTUs. PeerJ 5, e3889. https://doi.org/10.7717/peerj.3889

re

Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST.

lP

Bioinformatics 26, 2460–2461. https://doi.org/10.1093/bioinformatics/btq461

na

Edgar, R.C., Haas, B.J., Clemente, J.C., Quince, C., Knight, R., 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27, 2194–

ur

2200. https://doi.org/10.1093/bioinformatics/btr381

Jo

Escobar-Zepeda, A., Godoy-Lozano, E.E., Raggi, L., Segovia, L., Merino, E., Gutiérrez-Rios, R.M., Juarez, K., Licea-Navarro, A.F., Pardo-Lopez, L., SanchezFlores, A., 2018. Analysis of sequencing strategies and tools for taxonomic annotation: Defining standards for progressive metagenomics. Sci. Rep. 8, 12034. https://doi.org/10.1038/s41598-018-30515-5 Escobar-Zepeda, A., Vera-Ponce de León, A., Sanchez-Flores, A., 2015. The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics. Front. Genet. 6, 348. https://doi.org/10.3389/fgene.2015.00348

Journal Pre-proof Janda, J.M., Abbott, S.L., 2007. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls. J. Clin. Microbiol. https://doi.org/10.1128/JCM.01228-07 Johnson, J.S., Spakowicz, D.J., Hong, B.Y., Petersen, L.M., Demkowicz, P., Chen, L., Leopold, S.R., Hanson, B.M., Agresta, H.O., Gerstein, M., Sodergren, E., Weinstock, G.M., 2019. Evaluation of 16S rRNA gene sequencing for species

ro

https://doi.org/10.1038/s41467-019-13036-1

of

and strain-level microbiome analysis. Nat. Commun. 10.

re

https://doi.org/10.3390/d2020207

-p

Jost, L., 2010. The relation between evenness and diversity. Diversity 2, 207–232.

lP

Jünemann, S., Sedlazeck, F.J., Prior, K., Albersmeier, A., John, U., Kalinowski, J.,

na

Mellmann, A., Goesmann, A., Von Haeseler, A., Stoye, J., Harmsen, D., 2013. Updating benchtop sequencing performance comparison. Nat. Biotechnol.

ur

https://doi.org/10.1038/nbt.2522

Jo

Kaehler, B., Bokulich, N., McDonald, D., Knight, R., Caporaso, G., Huttley, G., 2018. Species abundance information improves sequence taxonomy classification accuracy 406611. https://doi.org/10.1101/406611 Kim, B.R., Shin, J., Guevarra, R.B., Lee, J.H., Kim, D.W., Seol, K.H., Lee, J.H., Kim, H.B., Isaacson, R.E., 2017. Deciphering diversity indices for a better understanding of microbial communities. J. Microbiol. Biotechnol. 27, 2089– 2093. https://doi.org/10.4014/jmb.1709.09027 Klindworth, A., Pruesse, E., Schweer, T., Peplies, J., Quast, C., Horn, M., Glöckner,

Journal Pre-proof F.O., 2013. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 41, e1–e1. https://doi.org/10.1093/nar/gks808 Kullback, S., Leibler, R.A., 1951. On Information and Sufficiency. Ann. Stat. 22, 79– 86. https://doi.org/10.1214/aoms/1177729694

of

Loman, N.J., Misra, R. V., Dallman, T.J., Constantinidou, C., Gharbia, S.E., Wain, J.,

ro

Pallen, M.J., 2012. Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 30, 434–439.

-p

https://doi.org/10.1038/nbt.2198

re

Nguyen, N.-P., Warnow, T., Pop, M., White, B., 2016. A perspective on 16S rRNA

lP

operational taxonomic unit clustering using sequence similarity. npj Biofilms

na

Microbiomes 2, 16004. https://doi.org/10.1038/npjbiofilms.2016.4 Nilakanta, H., Drews, K.L., Firrell, S., Foulkes, M.A., Jablonski, K.A., 2014. A review

ur

of software for analyzing molecular Sequences. BMC Res. Notes.

Jo

https://doi.org/10.1186/1756-0500-7-830 Noecker, C., McNally, C.P., Eng, A., Borenstein, E., 2017. High-resolution characterization of the human microbiome. Transl. Res. 179, 7–23. https://doi.org/10.1016/j.trsl.2016.07.012 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.

Journal Pre-proof Plummer, E., Twin, J., Bulach, D.M., Garland, S.M., Tabrizi, S.N., 2015. A Comparison of Three Bioinformatics Pipelines for the Analysis of Preterm Gut Microbiota using 16S rRNA Gene Sequencing Data. J Proteomics Bioinform Cit. 8, 283–291. https://doi.org/10.4172/jpb.1000381 Pylro, V.S., Morais, D.K., de Oliveira, F.S., dos Santos, F.G., Lemos, L.N., Oliveira, G., Roesch, L.F.W., 2016. BMPOS: a Flexible and User-Friendly Tool Sets for

of

Microbiome Studies. Microb. Ecol. 72, 443–447. https://doi.org/10.1007/s00248-

ro

016-0785-x

-p

Pylro, V.S., Roesch, L.F.W., Morais, D.K., Clark, I.M., Hirsch, P.R., Tótola, M.R.,

re

2014. Data analysis for 16S microbial profiling from different benchtop

lP

sequencing platforms. J. Microbiol. Methods 107, 30–37.

na

https://doi.org/10.1016/j.mimet.2014.08.018 Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni,

ur

A., Swerdlow, H.P., Gu, Y., 2012. A tale of three next generation

Jo

sequencingplatforms: comparison of Ion Torrent, PacificBiosciences and Illumina MiSeq sequencers. BMC Genomics 13, 1–13. https://doi.org/10.1186/1471-216413-341

Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C., Ghalith, G.A. Al, Alexander, H., Alm, E.J., Arumugam, M., Bai, Y., Bisanz, J.E., Bittinger, K., Brejnrod, A., Colin, J., Brown, C.T., Callahan, B.J., Mauricio, A., Rodríguez, C., Chase, J., Cope, E., Silva, R. Da, Dorrestein, P.C., Douglas, G.M., Duvallet, C., Edwardson, C.F., Ernst, M., Fouquier, J., Gauglitz, J.M., Gibson, D.L., Gonzalez,

Journal Pre-proof A., Huttley, G.A., Janssen, S., Jarmusch, A.K., Kaehler, B.D., Kang, K. Bin, Keefe, C.R., Keim, P., Kelley, S.T., Ley, R., Loftfield, E., Marotz, C., Martin, B., Mcdonald, D., Mciver, L.J., Alexey, V., Metcalf, J.L., Morgan, S.C., Morton, J.T., Naimey, A.T., 2018. QIIME 2 : Reproducible , interactive , scalable , and extensible microbiome data science. https://doi.org/10.7287/peerj.preprints.27295 Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F., 2016. VSEARCH: a

of

versatile open source tool for metagenomics. PeerJ 4, e2584.

ro

https://doi.org/10.7717/peerj.2584

-p

Srinivasan, R., Karaoz, U., Volegova, M., MacKichan, J., Kato-Maeda, M., Miller, S.,

re

Nadarajan, R., Brodie, E.L., Lynch, S. V., 2015. Use of 16S rRNA Gene for

lP

Identification of a Broad Range of Clinically Relevant Bacterial Pathogens. PLoS One 10, e0117617. https://doi.org/10.1371/journal.pone.0117617

na

Větrovský, T., Baldrian, P., 2013. The Variability of the 16S rRNA Gene in Bacterial

ur

Genomes and Its Consequences for Bacterial Community Analyses. PLoS One 8,

Jo

e57923. https://doi.org/10.1371/journal.pone.0057923 Yarza, P., Yilmaz, P., Pruesse, E., Glöckner, F.O., Ludwig, W., Schleifer, K.-H., Whitman, W.B., Euzéby, J., Amann, R., Rosselló-Móra, R., 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645. https://doi.org/10.1038/nrmicro3330 Zhang, J., Chiodini, R., Badr, A., Zhang, G., 2011. The impact of next-generation sequencing on genomics. J. Genet. Genomics 38, 95–109.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

https://doi.org/10.1016/j.jgg.2011.02.003

Journal Pre-proof TABLES Table 1. Data processing pipelines QIIME v1.9.1 default parameters

VSEARCH v2.3.4

QIIME2 v2018.2

Assembled of pair end reads (MiSeq): multiple_join_paired_ends PGM: single ends read split_libraries_fastq Q≥20 - split_libraries_fa DADA2 denoising (pstq (default for max-ee 1.0): PGM and - qiime dada2 turning off the denoisefilters for paired Illumina (r=0; (Illumina) q=0; n=100)) - Quality filtering: - qiime dada2 denoiseVSEARCH single (fastq_maxee (PGM) 1.0) Discard

ro

of

DAT A PRET REAT MENT

QIIME v1.9.1 modified parameters

T AXONOMIC ANALYSIS (Refence database)

DIVERSIT Y ANALYSIS

OT U picking: de novo (included in DADA2)

No

Included in DADA2

Assign taxonomy using UCLUST method on QIIME (assign_taxonomy. py)

qiime featureclassifier classify-sklearn

re

lP

No

OT U picking: de novo OUT clustering: UPARSE method (VSEARCH) (-id 0.97)

na

CHIMERA REMOVAL

OT U picking: de novo (pick_otus.py) OT U clustering: UCLUST (default) (pick_rep_set.py) (align_seqs.py) (default: PyNast) CHIMERA SLAYER MET HOD (identify_chimeric_seq s.py) and (filter_alignment.py)

ur

(OT U picking)

OT U picking: open reference (pick_open_reference _otus)

Jo

CLUST ERING MET HOD

-p

singletons (VSEARCH) (minsize=2)

(assign_taxonomy.py) with UCLUST method and Greengenes13_8 97% as reference database

Greengenes 97% with UCLUST method

Qiime diversity alpha-rarefaction (QIIME 2_2018.8)

Greengenes13_ 8 97%

Journal Pre-proof Table 2. Expected and detected taxa at staggered mock community at family, genus and species level. A true positive is a taxon that was both observed and expected; A false positive is a taxon that was observed but not expected; A false negative is a taxon that was expected but not observed. FAMILY expected

detected

TP

FP

PGM

ILL

PGM

37

24

17

16

34

36

17

17

VSEARCH

26

26

17

QIIME 2

20

19

QIIME 1 MOD

PGM

17

PGM

8

0

1

17

19

0

0

9

9

0

0

5

4

1

1

-p

17 16

re

16

ILL

20

ro

QIIME 1

ILL

of

ILL

MO CK

FN

GENUS

detected

lP

expected

QIIME 1 QIIME 1 MOD

Jo

QIIME 2

expected

FN

PGM

ILL

PGM

ILL

PGM

ILL

PGM

60

39

17

16

43

23

0

1

52

52

16

16

36

36

1

1

39

38

16

16

23

22

1

1

23

26

15

15

8

11

2

2

ur

VSEARCH

17

FP

ILL

na

MO CK

TP

SPECIES detected

TP

FP

FN

ILL

PGM

ILL

PGM

ILL

PGM

ILL

PGM

77

52

8

8

69

44

12

12

63

66

6

7

57

59

14

13

VSEARCH

51

52

6

7

45

45

14

13

QIIME 2

27

34

4

5

23

29

16

15

PIPELINE

MO CK

QIIME 1 QIIME 1 MOD

20

ILL: Illumina M iSeq; PGM : Ion Personal Genome M achine. FP: false positive; FN: false negative, TP: true positive

Journal Pre-proof Table 3. Kullback-Leiber (KL) divergence and Spearman correlation coefficients for known mock composition and obtained mock composition after sequencing and analysis.

**

P < 0.0001.

Family

Genus

Species

Family

Genus

Species

ILL

0.062

2.247

8.219

0.722**

0.464**

-0.031

PGM

0.115

1.483

6.273

0.855**

0.606**

QIIME1 MOD

ILL

0.132

4.418

11.290

of

0.110

0.750**

0.519**

-0.470

PGM

0.183

4.589

8.952

0.808**

0.525**

0.009

VSEARCH

ILL

0.025

3.846

ro

QIIME1

10.219

0.852**

0.581**

0.004

PGM

0.225

4.131

9.184

0.768**

0.465**

0.055

ILL

0.473

4.020

15.997

0.798**

0.520**

0.024

PGM

0.805

4.423

12.527

0.550**

0.049

0.033

QIIME 2

-p

Platform

lP

Pipeline

Spearman correlation coefficient

re

Kullback-Leiber Divergence (DKL)

Jo

ur

na

ILL= Illumina M iSeq Platform; PGM : Ion Personal Genome M achine

Journal Pre-proof Table 4. Expected and detected taxa at staggered mock community at species level for SPINGO classifier. A true positive is a taxon that was both observed and expected; A false positive is a taxon that was observed but not expected; A false negative is a taxon that was expected but not observed. S pecies expected ILL

20

S pecies detected 95

True positive species 20

False positive species 75

False negative species 0

% Unassigned 56.1

ro

of

PGM 33 18 15 2 71.5 ILL: Illumina M iSeq; PGM : Ion Personal Genome M achine. FP: false positive; FN: false negative, TP: true positive

Jo

ur

na

lP

re

-p

.

Journal Pre-proof Table 5. Results of pipelines comparisons showing the “pros,” “cons,” and general indications for non-professional users.

Q IIME1

Q IIME1 MOD

Q IIME 2

Database Pros

SPING O species clasif f ier

Greengenes Best TDR, detection of E. coli,

Low divergence (D KL ) and high correlation for P GM

Low divergence and high correlation for ILL

High correlation for ILL at family level

High number of true positive species (detection of 90-100% species) Lowest number of false negative

Non-unassigned OTUs for ILL

of

Detection of species at low operon levels

Lowest TAR

4-5 species detected

7-8 species detected

P GM missed 7 expected genera: (Listeria, Escherichia, Pseudomonas, Enterococcus, Deinococcus, Actinomyces and Acinetobacter)

Heatmap: close similarity with between platforms

6-7 species detected

High percentage of unassigned data and false positive species

Best pipeline for ILL

Good pipeline at species level for ILL when dealing with taxonomic analysis

-p

Bacteroides and Bacteriodaceae not detected by P GM.

Based on RDP

Best TAR

Low divergence (D KL ) and high correlation for both ILL and P GM

Cons

VSEARCH

ro

Pipeline

re

8 species detected

lP

ILL missed Listeria and Escherichia Lowest TDR

Good pipeline for ILL Select this pipeline for P GM except if the target is the gut microbiota (Bacteroides is not detected)

P ipeline for P GM when identification of Bacteroides is needed

ur

Best pipeline for P GM

Jo

Indications

correlation

na

Non-significant P GM

Best accuracy (TAR) Not useful for P GM Not able to detect Listeria and Escherichia This pipeline only for Illumina when you look for high accuracy rate

for

Journal Pre-proof Figure legends Figure 1. Family level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) and four processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2) separated according to operon concentrations. Unassigned: non-identified features; Under-classified: features expected identified only in upper taxa; Misclassified:

ro

of

observed features that were not expected.

Figure 2. Genus level bacterial relative abundance for known and observed staggered

-p

mock community analyzed with two platforms (Illumina and PGM) and four

re

processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2)

lP

separated according to operon concentrations. Unassigned: non-identified features; Under-classified: features expected identified only in upper taxa; Misclassified:

na

observed features that were not expected.

ur

Figure 3. Incidence map showing bacteria found at the family, genus and species

Jo

levels (species (X), genus (green), family (blue), non-detected (pink)). Figure 4. Heatmap of relative abundance at family, genus and species level detected by four bioinformatics strategies (QIIME 1 default, QIIME 1 modified, QIIME 2 and Vsearch) and two sequencing platforms (ILL (Illumina MiSeq) and PGM) compared to the known relative abundance (HM-783D) (a, b and c). Figura 5. Heatmap of relative abundance at species level for SPINGO species classifier. Only the 20 expected taxa were included.

Journal Pre-proof Figure 6. (A) Taxon detection rate (TDR) and (B) taxon accuracy rate (TAR) at the family, genus and species levels. TDR was calculated as the number of true positive features divided by the total number of expected features. TAR is the number of true positive features divided by the total number of observed features. ILL = Illumina MiSeq platform; PGM = Ion Personal Genome Machine. Figure 7. Alpha diversity for different platforms-strategies. Alpha diversity indices for

-p

SUPPLEMENTARY MATERIAL

ro

evenness (c), and Shannon diversity index (d).

of

a range of rarified sequence depths: observed OTUs (a), Chao1 index (b), Pielou’s

re

Figure 1s. Family level bacterial relative abundance for known and observed

lP

staggered mock samples with two platforms (Illumina and PGM) and four processing

na

strategies (QIIME default, QIIME modified, Vsearch and QIIME 2). Unassigned: nonidentified features; Under-classified: features expected identified only in upper taxa;

ur

Misclassified: observed features that were not expected.

Jo

Figure 2s. Genus level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) and four processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2). Unassigned: non-identified

features; Under-classified: features expected

identified

only in upper taxa; Misclassified: observed features that were not expected. Figure 3s. Species level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) by SPINGO

Journal Pre-proof FIGURES Figure 1 a)

b) 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 0.0000

Stap hyloco ccaceae

QIIME1

QIIME2

Clostrid ia ceae

of

Streptococcaceae

PGM

ILL

c)

d)

0.0035

MOCK

Pseudomonadacea e

PGM

ro

MOCK

VSEARCH

QIIME1 MOD

QIIME1

QIIME2

ILL

HM783D

VSEARCH

QIIME1 MOD

QIIME1

0.0000

QIIME2

0.1000

Bacillaceae

VSEARCH

Rho dobacteraceae

QIIME1 MOD

0.2000

QIIME1

Enteroba cteriaceae

0.3000

HM783D

0.4000

QIIME2

0.5000

VSEARCH

0.6000

QIIME1 MOD

0.7000

-p

0.0140

0.0030

0.0120

0.0025

0.0100

0.0020

Moraxellaceae

0.0080

Actin omyceta ceae

0.0010

Bacteroidaceae

Helicobacteraceae

0.0060

re

0.0015

PGM

Jo

ur

e)

Listeriaceae

MOCK

QIIME2_PGM

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

QIIME2

VSEARCH ILL

VSEARCH_PG QIIME1 M MOD_PGM

HM783D

0.0000

QIIME1 MOD

QIIME2

VSEARCH

QIIME1 MOD

MOCK

Lacto bacillaceae

0.0020

Enterococca ceae

na

ILL

QIIME1

HM783D

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

0.0000

lP

Deinococcaceae

QIIME1

0.0005

0.0040

Neisseria ceae Propionibacteriaceae

PGM

QIIME1 VSEARCH_IL L MOD_ILL

QIIME2_ILL

QIIME1_ILL

QIIME1_PGM MOCK HM783D

QIIME1_ILL

QIIME2_PG M

QIIME1-ILL QIIME2_ILL

QIIME1 MOD_ILL VSEARCH_P QIIME1 GM QIIME1_PG MOD_PGM M

Unassigned

VSEARCH_IL MOCK L HM783D QIIME2_ILL

QIIME1 MOD_ILL

QIIME2_PG M

VSEARCH_ILL

Missclassified

Underclassified

QIIME1_PG MOCK MHM783D QIIME1 MOD_PGM VSEARCH_P GM

Journal Pre-proof Figure 2 a)

b)

0.3500

0.0350

0.3000

0.0300

0.2500

0.0250

0.2000

0.0200

0.1500

Streptococcus

0.1000

Stap hyloco ccus

0.0500

0.0150

Pseudomonas

0.0100

Clostrid iu m

0.0050

Rho dobacter

0.0000

0.0030 0.0100 Propionibacterium

0.0020

Neisseria

0.0015

0.0060

0.0020

ILL

na

ur Jo QIIME1_PG M

VSEARCH_I LL MOCK HM783D

QIIME2_ILL

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

HM783D

Actin omyces

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

QIIME2

HM783D MOCK

PGM

Underclassified QIIME1-ILL QIIME2_PG M

QIIME1-ILL

QIIME2_PG M

QIIME1 MOD_ILL

QIIME1 MOD_ILL

VSEARCH_P GMQIIME1 MOD_PGM

VSEARCH

Acineto bacter

QIIME1

re

Helicobacter

Missclassified

QIIME1-ILL

QIIME2_PG M

Bacteroides

0.0000

lP

Unassigned

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

HM783D

QIIME2

VSEARCH

QIIME1

QIIME1 MOD

e)

Deinococcus

0.0005

Lacto bacillus

0.0000

Enterococcu s

0.0010

Listeria

PGM

VSEARCH

-p

0.0040

PGM

ro

0.0025

0.0080

MOCK

of

d)

0.0120

MOCK

QIIME1 MOD ILL

PGM

c)

ILL

QIIME2

Bacillus QIIME1

QIIME2

VSEARCH

QIIME1 MOD

MOCK

Escherichia

QIIME1 MOD

ILL

QIIME1

HM783D

QIIME2

VSEARCH

QIIME1 MOD

QIIME1

0.0000

QIIME1 MOD_ILL

VSEARCH_I LL

VSEARCH_ PGM

QIIME2_ILL

QIIME2_ILL

VSEARCH_ PGM QIIME1 MOD_PGM

QIIME1 MOD_PGM VSEARCH_I LL

QIIME1_PG M

MOCK HM783D

QIIME1_PG M

MOCK HM783D

Journal Pre-proof

Q IIME 1 MO D

Q IIME 1 Operons

Mock species

PGM

ILL

PGM

ILL

VSEARCH PGM ILL

Q IIME 2 PGM

ILL

Actinomyces odontolyticus Bacteroides vulgatus 1,000

Deinococcus radiodurans

x

Species

Enterococcus faecalis

Genus

Streptococcus pneumoniae

Family only

x

x

x

x

Propionibacterium acnes

x

x

x

x

x

x

x

x

Bacillus cereus

x

x

x

x

x

x

x

Staphylococcus aureus

x

x

x

-p

Non- detected x

x

x

x

Streptococcus agalactiae

x

x

x

x

x

Escherichia/Shigella coli

x

x

Rhodobacter sphaeroides

x

lP

Acinetobacter baumannii Helicobacter pylori

x

x

x

of

Lactobacillus gasseri

10,000

Listeria monocytogenes

Clostridium beijerinckii 100,000

1,000,000

Staphylococcus epidermidis x

ur Jo

x

x

x

x

x

x

x

x

x

x

x

x

na

Streptococcus mutans

Figure 3

re

Pseudomonas aeruginosa

ro

Neisseria meningitidis

x

Journal Pre-proof Figure 4 a)

ro

of

QIIME2 ILL QIIME1 MOD PGM VSEARCH PGM QIIME2 PGM QIIME1 PGM HM783D QIIME1 MOD ILL QIIME1 ILL VSEARCH ILL

HM783D QIIME1 MOD PGM VSEARCH PGM QIIME2 ILL QIIME2 PGM QIIME1 PGM QIIME1 MOD ILL QIIME1 ILL VSEARCH ILL

c)

Jo

ur

na

lP

re

-p

b)

HM783D VSEARCH PGM QIIME1 ILL QIIME1 MOD ILL QIIME2 ILL QIIME1 MOD PGM QIIME2 PGM QIIME1 PGM VSEARCH ILL

Journal Pre-proof Figure 5

Jo

ur

na

lP

re

-p

ro

of

HM783D SPINGO PGM SPINGO ILL

0.00

FAMILY GENUS

ILL

PGM SPECIES

QIIME 2

VSEARCH

QIIME 1 MOD

QIIME 1

ILL

of QIIME 2

VSEARCH

QIIME 1 MOD

QIIME 1

QIIME 2

ro

VSEARCH

QIIME 1 MOD

QIIME 1

QIIME 2

VSEARCH

QIIME 1 MOD

QIIME 1

0.00

QIIME 2

GENUS

-p

re

FAMILY

VSEARCH

b)

QIIME 1 MOD

TAR

lP

0.80

QIIME 1

na

0.60

QIIME 2

VSEARCH

QIIME 1 MOD

0.40

ur

0.20

Jo

QIIME 1

Journal Pre-proof

Figure 6

a)

TDR

1.00

0.80

0.60

0.40

0.20

SPECIES

PGM

Journal Pre-proof Figure 7

of

a)

ur Jo

c)

na

lP

re

-p

ro

b)

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

d)

Journal Pre-proof SUPPLEMENTARY MATERIAL Figure 1s 100%

Actinomycetaceae Propionibacteriaceae

90%

Bacteroidaceae Bacillaceae

80%

Listeriaceae Staphylococcaceae

70%

of

Enterococcaceae

Lactobacillaceae Streptococcaceae

ro

60%

50%

Rhodobacteraceae

-p

40%

Clostridiaceae

Helicobacteraceae

re

30%

Neisseriaceae

Moraxellaceae

lP

20%

Enterobacteriaceae

0%

QIIME1 MO D VSEARCH

QIIME2

ur

ILL

Jo

QIIME1

na

10%

Pseudomonadaceae Deinococcaceae Unassigned Underclassified

HM783D MOCK

QIIME1

QIIME1 MO D VSEARCH PGM

QIIME2

Missclassified

Journal Pre-proof Figure 2s

100% 90% 80% 70%

60%

of

50% 40%

ro

30%

-p

20% 10%

re

0%

lP

QIIME1 QIIME1 VSEARCH QIIME2 HM783D QIIME1 QIIME1 VSEARCH QIIME2 MOD MOD

MOCK

Jo

ur

na

ILL

PGM

Acinetobacter Actinomyces Bacillus Bacteroides Clostridium Deinococcus Enterococcus Escherichia Helicobacter Lactobacillus Listeria Neisseria Propionibacterium Pseudomonas Rhodobacter Staphylococcus Streptococcus Underclassified Missclassified Unassigned

Journal Pre-proof Figure 3s 0.25

0.03 0.025

0.2

0.02

0.15 0.015 0.1

0.01

0.05

0.005

0 E scherichia c ol i

Rhodobac te r sphae roi de s

SPINGO_ PGM

Staphy lococcus e pi de rmidis

SPINGO_ILL

Bacill us cere us

HM-783D

SPINGO_ PGM

Pseudomonas aeruginosa

0.0012

0.005

0.001

0.004

0.0008

0.003

0.0006

0.002

0.0004

0.001

0.0002

Staphy lococcus aureus

SPINGO_ILL

Stre ptococcus agalactiae

HM-783D

He lic obacter pyl ori

Lis te ria monoc ytoge nes

Ne isse ria me ningitidis

Propionibac te rium acnes

SPINGO_ILL

HM-783D

Jo

ur

na

lP

SPINGO_ PGM

Lactobaci llus gas seri

0

re

Aci ne tobacter baumannii

-p

ro

0.006

0

Clostri di um beij erinck ii

Stre ptococcus mutans

of

0

Actinomyc es odontol yticus

Bacteroides vulgatus

SPINGO_ PGM

De inococ cus radi odurans

E nte rococc us fae calis

SPINGO_ILL

Stre ptococcus pne umoni ae

HM-783D

Journal Pre-proof Highlights

ur

na

lP

re

-p

ro

of

16S rRNA short-read sequence data gene presents limitations at species level. For metagenomic analysis up to genus level, most classifiers perform well. SPINGO classifier is useful to address the limitations at species level. For PGM data, the optimal pipeline was QIIME 1.9.1 with default parameters. All the protocols implemented in QIIME performed well for Illumina data.

Jo

    