Journal Pre-proof Bioinformatic strategies to address limitations of 16rRNA shortread amplicons from different sequencing platforms
María Bailén, Carlo Bressa, Mar Larrosa, Rocío González-Soltero PII:
S0167-7012(19)30784-5
DOI:
https://doi.org/10.1016/j.mimet.2019.105811
Reference:
MIMET 105811
To appear in:
Journal of Microbiological Methods
Received date:
9 September 2019
Revised date:
14 December 2019
Accepted date:
15 December 2019
Please cite this article as: M. Bailén, C. Bressa, M. Larrosa, et al., Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms, Journal of Microbiological Methods (2019), https://doi.org/10.1016/ j.mimet.2019.105811
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier.
Journal Pre-proof Bioinformatic strategies to address limitations of 16rRNA short-read amplicons from different sequencing platforms María Bailéna, Carlo Bressaa, Mar Larrosaa, Rocío González-Solteroa# a
Universidad Europea de Madrid, Faculty of Biomedical and Health Sciences,
-p
Corresponding author: Rocío González Soltero
ro
of
Nutrition, Microbiota and Health group, Villaviciosa de Odón, Madrid 28670, Spain.
re
[email protected]
lP
Keywords: mock, NGS platforms, metagenomic analysis pipelines, bacterial
Jo
ur
na
community, MiSeq, PGM.
Journal Pre-proof ABSTRACT: Sequencing the 16S gene rRNA has become a popular method when identifying bacterial communities. However, recent studies address differences in the characterization based on sample preparation, sequencing platforms, and data analysis. In this work, we tested some of the available user-friendly protocols for data analysis with the reads obtained from the sequencing machines Illumina MiSeq and Ion Torrent Personal Genome Machine (PGM). We sought for the advantages and disadvantages
of
of both platforms in terms of accuracy, detected species, and abundance, analyzing a
ro
staggered mock community. Four different pipelines were applied: QIIME 1.9.1 with
-p
default parameters, QIIME 1.9.1 with modified parameters and chimera removal,
re
VSEARCH 2.3.4, and QIIME 2 v.2018.2. To address the limitations of species level detection we used species-classifier SPINGO. The optimal pipeline for PGM platform,
lP
was the use of QIIME 1.9.1 with default parameters (QIIME1), except when a study
na
requires the detection of Bacteroides or other Bacteroidaceae members, in which QIIME1MOD (with chimera removal) seems to be a good alternative. For Illumina
ur
Miseq, VSEARCH strategy can be a good option. Our results also confirm that all the
Jo
tested pipelines can be used for metagenomic analysis at family and genus level.
Journal Pre-proof 1. Introduction 1.1. Limitations of 16S rRNA short-read amplicons sequencing Sequencing of the 16S rRNA gene is the most popular method for characterizing the complexity
of microbial communities.
The
latest
advances in high-throughput
sequencing techniques and next-generation (NGS) sequencing platforms allow users to assign taxonomy considering the hypervariable regions contained in the 16S rRNA
of
gene. Studies suggest that 16S rRNA gene sequencing provides genus identification in
ro
most cases but less so with regard to species (Janda and Abbott, 2007). It has been
-p
pointed out that a key limitation of using 16S rRNA gene sequencing is that related
re
bacterial species may be indistinguishable due to the homology between sequences
lP
(Plummer et al., 2015). Sequencing of hypervariable sub-regions of this gene produces single short-reads of 100-500 nucleotides, when the gene contains 1542 base pairs, variations out of sequencing area (Zhang et al., 2011). Although some sub-
na
missing
regions provide a good approximation of 16S diversity, most do not distinguish
ur
polymorphisms between closely related taxa, which could be avoided by the complete
Jo
sequencing of the gene (Johnson et al., 2019). Also, the presence of multiple intragenomic copies of this gene (1 up to 5 or more), which could carry subtle nucleotide substitutions,
may overestimate microbial abundance (Johnson et al., 2019;
Větrovský and Baldrian, 2013). 1.2. Comparison of platforms using short-reading lengths Another source of bias when selecting sequencing platforms and the subsequent data processing is the software of choice and the analytical protocol (pipeline). It has been previously reported that the employment of different platforms can yield differences in
Journal Pre-proof sample characterization depending on their accuracy, sensitivity and error rate (Claesson et al., 2017). Up to date, no standard evaluations studies exist to select the best
platform in terms of accuracy or sensitivity when evaluating complex
communities. Escobar-Zepeda et al., have recently publish a benchmark tool that looks into error rates and sensitivity (Escobar-Zepeda et al., 2018). Loman and co-workers showed that Illumina MiSeq yielded the lowest error rate when compared with both
of
Ion Torrent Personal Genome Machine (PGM) and the Roche 454 platform (Loman et
ro
al., 2012). Nowadays, the Roche 454 has been discontinued whereas PGM and
-p
Illumina MiSeq bench-top versions are the platforms of election for researchers
re
starting microbiota characterization projects, and the prevalence of Illumina over PGM has significantly increased during the last years (Escobar-Zepeda et al., 2015). There
lP
have been some improvements on the PGM platform with the availability of new
na
sequencing kits and chips that decrease the lower indel error rates, but the Illumina MiSeq platform seems to be better avoiding indel errors (Jünemann et al., 2013). The
ur
Illumina MiSeq technology uses an approach of sequencing by synthesis, utilizing
Jo
fluorescently labeled reversible-terminator nucleotides and, depending on the model, it takes several days to obtain the data (Pylro et al., 2014). It is probably the most popular sequencing platform, in view of its competitive cost, lower than in the case of the PGM machine (Escobar-Zepeda et al., 2015). For Illumina MiSeq, the hypervariable regions V3-V4 are the most used for microbiota characterization analysis. The amplicon is generated by paired-ends reads covering around 460 nucleotides and allowing a certain degree of overlap in the readings at its 3'end, the area of greatest loss of quality (Barb et al., 2016; Claesson et al., 2010; Klindworth et al., 2013; Quail et al., 2012).
Journal Pre-proof Considering the sequencing protocols, the PGM platform seems to be the best for characterization of hypervariable regions (V1-V9), due to the sequencing of all the regions at the same time by the most recent versions of their chips and larger singleread sequences (Barb et al., 2016). Due to this coverage, Quail et al. found that the performance of the PGM machine was better, but still at the expense of a higher false positive rate (Quail et al., 2012).
of
1.3. Comparative analysis of pipelines for microbiota data
ro
Analytical pipelines applied to data processing can also introduce bias factors, as not
-p
all available pipelines include the same steps (Allali et al., 2017). For a non-expert
re
bioinformatician, to start a microbial community analysis project requires some initial input into the available algorithms, software and pipelines. There is limited literature
lP
comparing the advantages and disadvantages of available software packages. The
na
choice depends on the level of taxonomic depth and complexity to be achieved in a study. Nilakanta et al., concluded that QIIME and Mothur where the most
ur
comprehensive pipelines and both include features and support documentation
Jo
(Nilakanta et al., 2014). D’Argenio et al. did not find statistically significant differences in the taxonomic assignment and alpha diversity measures when using MG-RAST and QIIME, but stated that QIIME produced more accurate assignments (D’Argenio et al., 2014). QIIME1 protocols were initially established for 454 pyrosequencing raw data and later adapted to Illumina technology (Caporaso et al., 2012). Later, Bokulich 2013 presented new guidelines trying to improve default parameters for quality filtering of Illumina reads (Bokulich et al., 2013). Now QIIME is under the transition to QIIME2, where important changes have been made in terms
Journal Pre-proof of overrepresentation of OTUs, characterization (now “features”) and other quality related factors (Bolyen et al., 2018). VSEARCH is an open source alternative to the USEARCH tool and can be implemented in the QIIME package. It includes chimera detection, paired-end reads merging and dereplication (Rognes et al., 2016) and was designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not free.
of
SPINGO is a novel method to annotate sequences at genus or species level (Allard et
ro
al., 2015). Considering the results reported by Allard et al., and more recently by
-p
Escobar-Zepeda the use of SPINGO based on the RDP database seems a good method
re
to report high accuracy at the species level (Bokulich et al., 2013; Escobar-Zepeda et al., 2018). Escobar-Zepeda considered also the use of different databases concluding
lP
small, but more curated databases, like RDP improved and their results in terms of
na
sensitivity, specificity and accuracy (Escobar-Zepeda et al., 2018). After all those studies, there is not yet a well-established protocol to decide which
ur
platform adapts better to specific research topics and there are not defined protocols
Jo
considering all the sources of constraints detailed above, especially for novices in the field. The main objective of this study was to analyze a known bacterial community (mock) in a pilot study using two common benchtop sequencing platforms that are currently on the market and using four different analysis pipelines: QIIME1, QIIME1MOD, VSEARCH and QIIME2. The use of a standardized mock allows to measure directly the sequencing and analytical methods without considering the bias introduced by sample preparation. Also, to increase taxonomic resolution at species level of the studied mock community, the species classifier SPINGO was used.
Journal Pre-proof
2. Material and methods 2.1. Mock microbial community Microbial Mock Community B (HM-783D, staggered, low concentration, v5.2L, for 16SrRNA gene sequencing) contained known operons counts of 20 species as per which can be obtained from BEI (Manassas,
of
Certificate of Analysis for HM-783D,
ro
Virginia, U.S.A.)(BEI Resources, 2014), including Gram-positive and Gram-negative
-p
bacteria. The sample contained a staggered number of ribosomal RNA operon counts
re
from 103 to 106 .
na
2.2.1. Illumina MiSeq Analysis
lP
2.2. Sequencing
Bacterial 16S rRNA gene analysis was performed by amplifying the hypervariable
ur
regions V3-V4 of the 16s rRNA gene using the primer pair:
Jo
5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3' and 5'- 165 GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3’ (Klindworth et al., 2013).
Amplification was done with a GeneAmp PCR System 9700 thermocycler (Applied BioSystems, Foster City, CA, USA). PCR reactions were carried out following the manufacturer’s instructions. PCR products were analyzed in a 1% agarose gel. Before sequencing, amplicons were purified with the MinElute Gel extraction kit (Qiagen, Hilden, Germany). For sequencing, the Illumina MiSeq version 3 (600 cycles) platform was used (Illumina, San Diego, CA) (Claesson et al., 2010).
Journal Pre-proof 2.2.2. Ion Torrent PGM Analysis Similarly, 16S region was amplified with Ion 16S T M Metagenomics Kit (Life Technologies, Spain) by two separated PCR reactions using primer sets: V2–4–8 and V3, V6-7 y 9. The primer sets: V2–4–8 and V3, V6-7 and 9 refers to as primer set 1, which includes primers for amplification of regions: v2, v4 and v8 (V2–4–8), and primer set 2 which includes primers for amplification of regions: v3, v 6 and v7 in a
of
single amplicon, and v9. Equal volumes of the V2, V4, V8, V3, V6-7, and V9
ro
amplifications reactions were combined. Obtained amplicons were processed to make
-p
the DNA library using the KAPA Library Preparation Kit for Ion Torrent Platforms.
re
Library quantification was carried out by Qubit dsDNA Assay Kit. Library quality was checked by agarose gel electrophoresis. PCRs were done following manufacturer’s
lP
indications as previously reported by Barb et al., 2016. Sequences were purified using
na
Agencourt AMPure XP Reagent for mock and Axy Prep PCR Clean-up. Emulsion PCRs were analyzed using the Ion PGM Hi Q Template kit according to the
ur
manufacturer’s instructions. Sequencing of the amplicon library was carried out on an
Jo
Ion 314T M Chip on the Ion Personal Genome Machine (PGM). After sequencing, all reads were filtered to remove those of low quality. Multiplex PCR reactions were subsequently combined and further processed as a single sample containing various amplicons. Demultiplexing as well as barcode trimming and basic quality trimming occur by default at the end of sequencing run and are performed still on the sequencer. 2.3. Bioinformatics The software QIIME (Quantitative Insights Into Microbial Ecology) was used to assess the platform comparison (Caporaso et al., 2010; Rideout et al., 2018). We
Journal Pre-proof compared four different data analyses strategies or pipelines to find the one that best fit our mock composition. The objective is to use the mock as baseline to test running conditions and to select the better procedure with the rest of the samples. Considering that these days there is a transition from QIIME to QIIME2, both the QIIME v1.9 and QIIME 2 v18.2, April 2018, were applied to compare the obtained results (Table 1). We also used VSEARCH, the new pipeline of the BMP (Brazilian Microbiome
of
Project) that is available online (http://www.brmicrobiome.org/) including protocols
ro
and guidelines for analyzing data coming from the two platforms (Illumina MiSeq and
-p
PGM).
re
Sequences were clustered into bins called ‘Operational Taxonomic Units’ (OTUs) based upon similarity. This similarity between a pair of sequences is computed as the
lP
percentage of sites that agree in a pairwise sequence alignment. A common similarity
na
threshold used is 97% (Nguyen et al., 2016).
For all these pipelines, the taxonomic classification was conducted using Greengenes
ur
13_8 97%, which seems accurate to genus level but loses specificity at the species
Jo
level (Yarza et al., 2014).
The two Illumina sequencing runs (paired-end reads) were assembled with fastq format using the script join_paired_ends.py (Aronesty, 2013). Single-ends PGM reads were directly introduced in the pipeline as an individual sequence. Subsequently, the data analysis was done as follows and the main scripts and commands were summarized in Table 1: a) Pipeline 1 (QIIME default parameters; QIIME v1.9.1 (Pylro et al., 2014)). The output file was pre-processed for quality filtering with Q20 as the minimum
Journal Pre-proof Phred quality score (q=19), using the QIIME script (split_libraries_fastq.py) (Bokulich et al., 2013; Caporaso et al., 2012). OTUs were assigned using openreference OTU picking protocol (pick_open_reference_otus.py) which also runs
assign_taxonomy.py
with the UCLUST algorithm as default and
Greengenes13_8 as reference database. During this step sequences were clustered into OTUs at 97% similarity.
of
b) Pipeline 2 (QIIME modified parameters; QIIME1_MOD (Edgar et al., 2011)).
ro
This strategy comprises as upstream following data pretreatment as for pipeline
-p
1 (split_libraries_fastq.py; quality filtering Q20). The downstream analysis
re
follows as indicated in Table 1: OTU picking through pick_otus.py->selection of representative sequences->Chimera detection using the Chimera Slayer
lP
method->taxonomic assignments using the UCLUST method (Edgar, 2010).
na
c) Pipeline 3 (Pipeline 2017 Pylro). This pipeline uses VSEARCH, an updated version of the USEARCH pipeline available in the web of BMP (version
ur
November 2017) (Pylro et al., 2016, 2014; Rognes et al., 2016). Data
Jo
pretreatment includes the same script (split_libraries_fastq.py) but, in the case of Illumina data, filters were turned off (Phred quality score q=0) and default parameters applied for PGM data analysis. A fastq_maxee 1.0 was also applied to data and later, singletons were discarded before the OTU assignment. OTU clustering method was UPARSE (-id 0.97). The taxonomic assignment for Illumina data was done as indicated in the pipeline 1. d) Pipeline 4 (QIIME2 v.2018.2): raw sequence data were imported into QIIME 2 software using the QIIME2 command-line interphase (q2cli) to import data with the QIIME tools import method. The data were trimmed, truncated,
Journal Pre-proof quality filtered, dereplicated, and chimeras removed with the QIIME2 plugin DADA2 with p-max-ee 1.0 (denoised-paired for Illumina and denoise-single for PGM) (Callahan et al., 2016). DADA2 is used for modeling and correcting amplicon errors and includes quality filtering, dereplication, and chimera identification. Representative sequence sets for each DADA2 sequence variant were used for taxonomic classification (QIIME 2 plugin feature-classifier
of
(Bokulich et al., 2018) with the sklearn method (Pedregosa et al., 2011). The
ro
above pipelines (1-4) used the Greengenes v.13_8 database for taxonomic
-p
assignment.
re
For diversity analysis with the tested pipelines alpha-rarefaction plugin (implemented
lP
in QIIME2) was used (Bolyen et al., 2018).
To address these limitations and obtain a higher taxonomic resolution, we used
na
species-classifier SPINGO that uses a custom database, based on RDP (Ribosomal Database Project). This software application can classify assignable species sampled
ur
form any environment. Quality filtering of paired-ends and single-end reads with
Jo
SPINGO was done as in the pipelines 1 and 3 with Q20. A reads.fa file was created with the filtered sequences and then classified by using the species classifier SPINGO version 1.3 64 bits with default parameters and its own database based on RDP (Allard et al., 2015). 2.4. Statistical comparison of pipelines and sequencing platforms To compare the performance of each pipeline with data from the two different sequencing platforms, taxonomic summary files were compared with the known staggered mock sample by computing Spearman’s rank correlation coefficient and
Journal Pre-proof Kullback-Lieber
divergence.
Kullback-Lieber
divergence
(DKL)
measures
the
difference between two probability distributions, where a larger DKL indicates a greater divergence between distributions (Kullback and Leibler, 1951) whereas the Spearman’s coefficient is used to discover the strength of a link between two data sets. Also, Wilcoxon signed-rank test was used to evaluate statistical differences between observed and expected taxa at family, genus and species level, and contingency
of
analysis was used to evaluate differences between true positive and false positive taxa
ro
(Table 2).
-p
The parameters taxon accuracy rate (TAR) and taxon detection rate (TDR), which are
re
useful alternatives, rely on the presence/absence of expected taxa calculated. TDR is the number of true positive features divided by the total number of expected features,
lP
and TAR is the number of true positive features divided by the total number of
na
observed features. Considering taxa abundance unassigned, under-classified and misclassified taxa were used. Unassigned are non- identified features, under-classified
ur
are features expected identified only in upper taxa whilst misclassified taxa are
Jo
observed features that were not expected (Bokulich et al., 2018). Alpha diversity metrics are widely used for the characterization of communities. It has two components, species richness and equitability indices. To study species richness, rarefaction plots of observed OTUs (OTUs actually observed in samples) and chao1 (OTUs expected in samples) were used. For equitability, Pielou’s evenness studies the uniformity of the population size of each of the species (Jost, 2010). Shannon diversity estimates both species richness and evenness. This index is commonly used to characterize species diversity and assumes that all species are represented in a sample if they are randomly sampled (Kim et al., 2017).
Journal Pre-proof
3.
RESULTS
3.1. Benchmarking different analytical protocols using a mock community:
of
advantages and disadvantages
ro
Mock communities are useful for method benchmarking as the actual composition of a
The sequenced mock sample provides a
re
community profiling with accuracy.
-p
mock community is known in advance and allows quantitative assessments of
theoretical community to be compared to the observed after sequencing. The genomic
lP
DNA from the microbial Mock Community HM-783D contains twenty individual
na
species with operons ranging from 1,000 to 1,000,000 (Staggered), being five species at the lowest operon level: Actinomyces odontolitycus, Bacteroides vulgatus,
ur
Deinococcus radiodurans, Enterococcus faecalis, and Streptococcus pneumoniae.
Jo
Considering the expected and detected bacteria, we compared the results at family, genus and species level after sequencing the mock sample in the PGM and Illumina MiSeq v3 (600 cycles) machines (Table 2). At family level, all pipelines for both platforms detected the 17 expected families, except for pipelines QIIME1-PGM, QIIME2-ILL
and
QIIME2-PGM,
missing
Bacteriodaceae,
Listeriaceae
and
Enterococcaceae respectively (Table 2 and Figs. 1 and 1s). Overestimated families by most of the pipelines were Bacteroidaceae, Helicobacteriaceae, Enterobacteriaceae and Pseudomonaceae. The under-classified OTUs (expected features identified only in upper taxa) were reduced drastically at family level compared to genus and species.
Journal Pre-proof QIIME 2-PGM, presented the highest percentage of unassigned and under-classified OTUS with no misclassifications (observed features that were not expected) at family level. Also, analysis of data reported by Illumina produce higher percentages of misclassifications than data from PGM (Fig. 1e). At genus level, Escherichia was only detected by QIIME1 (both platforms) and QIIME1-ILL was the only strategy able to detect the 17 expected genera. Other not detected genera were Bacteroides by
of
QIIME1-PGM, Listeria by QIIME2-ILL and Enterococcus by QIIME2-PGM (Table
ro
2, Figure 2). There were no unassigned OTUs for QIIME 2-Illumina and considering
-p
the misclassified and under-classified OTUs QIIME 2 data presented the highest
re
percentages as it was seen at family level (Fig. 2e).
Finally, at species level, only 25-40% species (4-8 species) were identified by the
lP
different bioinformatic strategies (Table 2).
na
Considering the false positive (FP) and false negative (FN) taxa, QIIME 2 produce the highest FN and the lowest FP taxa at all levels as opposed to QIIME1-ILL with the
ur
highest FP and lowest FN of all the tested strategies (Table 2). Contingency analysis
Jo
did not find statistical differences between false positive and false negative taxa at each of the studied pipelines. To compare the taxonomic identification performed at family, genus and species levels, OTU counts were summarized in an incidence map (Fig. 4), and the 20 species presented in the mock sample were classified according to the operon count. Any OTU not classified up to the family level was labelled as “non-detected.” Our results showed that Helicobacter pylori and Propionibacterium acnes were the only two species detected by all strategies/platforms, with an operon count of 105 for both species.
Also,
Bacillus cereus,
Staphylococcus aureus, Staphylococcus
Journal Pre-proof epidermidis,
Streptococcus
agalactiae,
and
Rhodobacter sphaeroides,
which
appeared at high operons counts (105 -106 ), were detected by most of the strategies/ platforms (Fig. 2) at species level. Limitations in detection were found for Escherichia coli that was only detected at the species level by QIIME1 (rest of pipelines
only
difficulties
achieve
detecting
family
level),
Enterococcus
QIIME1-PGM
faecalis
and
(from PGM
QIIME2 data)
and
showed Listeria
of
monocytogenes (from Illumina data) at family, genus and species levels. The same
ro
was observed in the case of Bacteroides vulgatus following the QIIME1-PGM
-p
(Figure 3).
re
To quantify the distribution of observed and expected taxa, the DKL was calculated at family, genus and species levels. In addition, Spearman's rank correlation coefficient
lP
was calculated between observed and expected taxa (Table 3). At family level, the
na
lowest DKL were VSEARCH and QIIME1, in the case of Illumina and QIIME1 for PGM. At genus level, the lowest DKL was obtained by QIIME1 in both Illumina and
ur
PGM. At species level, divergence range from 8 to 16.
Jo
Results of correlation were statistically significant at family and genus level but not for species being QIIME1-PGM and VSEARCH-ILL, the both family and genus. Moreover, Wilcoxon signed-rank test was used to evaluate the differences between observed and expected taxa without statistical differences at family and genus level. Only at species level we found statistical differences, except for QIIME1-PGM (data not shown). Finally, a heatmap of relative abundance at family, genus and species level was generated (Fig. 4). Only strategies using Illumina were clustered at family level with
Journal Pre-proof the expected mock, except for QIIME2 (Fig. 4a). At genus and species level (Fig 4b and 4c), none of the strategies were clustered with the mock. 3.2. Accuracy and Sensitivity: Taxon Detection Rate and Taxon Accuracy Rate For mock sample characterizations, Taxon Accuracy Rate (TAR) has been used to evaluate accuracy in taxonomy classifications, and Taxon Detection Rate (TDR) as an approximation
for
sensitivity
measures
and
provide comparable
results when
of
analyzing mock communities (Bokulich et al., 2018). All the pipelines showed high
ro
TDR score, being QIIME2 the one with the lowest TDR in the two platforms (Fig. 5a).
-p
On the contrary, QIIME2 reported the highest accuracy (TAR) at all levels (Fig. 5b).
re
3.3. Species identification by the species classifier SPINGO Identification of OTUs to species level is quite relevant when studying microbiota, as
lP
some species from the same genus may be pathogenic and other commensals. To
na
increase taxonomic resolution at species level of the mock community studied we have used Species classifier SPINGO. The analyses of the mock community by SPINGO
Jo
Illumina (Table 4).
ur
detected 90-100% of the expected species, 18 from the PGM platform and 20 from
Considering the detectability threshold, SPINGO-ILL could detect the five species at low operon levels, and SPINGO-PGM only three, where neither Streptococcus pneumoniae
nor Enterococcus faecalis
were detected.
Three species were
overestimated, Helicobacter pylori and Pseudomonas aeruginosa for ILL and Bacteroides vulgatus by PGM (Fig. 3s). A heatmap of relative abundance at species level for SPINGO is shown at Fig. 5, where a unique cluster between the two pipelines can be seen. Wilcoxon signed-rank test reported statistical differences between SPINGO-PGM but not for SPINGO-ILL.
Journal Pre-proof 3.4. Comparing technologies using Alpha diversity approaches As the sequencing depth can affect alpha diversity measures, analyses were done using rarefaction curves, and the number of reads was standardized for all samples and platforms. Rarefaction analyses were performed at a maximum sampling depth of 29,421 reads. Rarefaction curves of observed species began to plateau, indicating enough sequencing depth. To compare how diversity and evenness of the observed
of
data match the expected data, rarefaction of observed species, Chao1 index, Pielou’s
ro
evenness and the Shannon diversity metrics were performed (Fig. 7).
-p
The highest richness (chao1 and observed OTUs) was observed by QIIME1 MOD-
re
PGM, followed by QIIME1-ILL, and lowest with QIIME 2-ILL (Fig. 7 a and b).
lP
Conversely, results of equitability by Pielou`s and Shannon index, indicated higher alpha diversity for QIIME2-PGM, and lowest for QIIME1 MOD-ILL. Strategies using
na
data from PGM presented a higher evenness and Shannon index than those for Illumina (Fig. 7 c,d).
ur
Increased diversity was observed by PGM, and regarding the pipelines, by QIIME2-
Jo
PMG, when both species richness and evenness were considered as reported by Shannon diversity index. 4. DISCUSSION One of the most important decisions a novice researcher must take before starting a project based on microbiota characterization is to determine the sequencing platform and how the upcoming data will be analyzed, using the available wrapped protocols. The goal of this study was to identify key parameters to define a bioinformatics strategy for analyzing data coming from short-reads sequences of the hypervariable
Journal Pre-proof regions of the 16S rRNA gene, obtained from two different sequencing platforms, PGM and MySeq. Different approaches were used to characterize a mock community (low diversity sample). Mock communities contain well-characterized species and can be useful for novice users to test their results and become familiar with the protocols. 16s RNA gene sequencing, albeit with some limitations such as multiple gene copies (Carlos et al., 2012; Větrovský and Baldrian, 2013), is the most used tool to identify
of
bacterial communities, but frequently limited to genus level of identification (Johnson
ro
et al., 2019; Noecker et al., 2017). If the objective of the study is characterization at
-p
the species level, then short-read sequencing might not be the best solution, as
re
evidenced by our statistical analysis between expected and observed species (Table 2). These results agree with those recently reported by Johnson et al. (2019). It has also
lP
recently been reported that incorporating environment-specific taxonomic abundance
na
information increases species level classification accuracy across common sample types (Kaehler et al., 2018).
ur
The platform of choice sometimes depends on budget or availability, there being a
Jo
considerable difference in cost between the second and third-generation sequencing platforms currently on the market. In our case, we compared Illumina MiSeq and PGM platforms (second-generation with a lower cost). Regarding the software, there is limited literature comparing the advantages and disadvantages of available software packages (D’Argenio et al., 2014; Nilakanta et al., 2014). For a non-experienced bioinformatician, to start a microbiota project requires some insights in the available methods. When starting our project, we selected the QIIME toolkit as is open-source and has an enormous network of users. Also, the
Journal Pre-proof developers use to provide fast feedback to the users when using their code (https://forum.qiime2.org/). It matters to say that during our study this software was undergoing a transition to QIIME2 (QIIME1 has been out of service since January 2018) (Bolyen et al., 2018). Considering this situation, we used both, the old and the newer version. The reported results were not as comparable as they should be. The transition to QIIME 2 also changes the well-known system of OTUs classification.
of
The new taxonomic unit assignment method is known as “feature classifier”. QIIME
ro
protocols were initially established for 454 pyrosequencing raw data and adapted to
-p
Illumina technology due to an increase in the use of the Illumina sequencing machine
re
(Caporaso et al., 2012). Later, Bokulich presented new guidelines trying to improve default parameters for quality filtering of Illumina reads (Bokulich et al., 2013), being
lP
some of them recently discussed by Edgar, (2017). Due to that, we include VSEARCH
na
(the free version of USEARCH) in our analysis. Our results reported that when using QIIME 2, fewer taxa were identified (low TDR
ur
and richness), but with higher accuracy (high TAR). The missing information when
Jo
using QIIME2 may be related to the use of DADA2 denoising methods, not comparable to quality filtering methods. DADA2 denoising is specialized in reducing the presence of false positives and could make difficult to detect low operon count species. DADA2 seems to be a good choice for assigning identity with high accuracy using high-quality data. Regarding the PGM platform, the use of QIIME 1 seemed to be the optimal pipeline in terms of correlation and divergence, except for analysis requiring the detection of Bacteroides or other Bacteroidaceae members, as gut microbiota studies. In these
Journal Pre-proof specific situations, QIIME1MOD (with chimera removal) seems to be a good alternative (Allali et al., 2017).Concerning Illumina MiSeq, VSEARCH can be a good option as it showed low divergence and high correlation at the family and genus levels. Similar results were reported by QIIME1. Most microbiota characterization studies have the genus identification level as a target. We must point out that VSEARCH did not detect the genus Escherichia. Something
of
similar happened in the case of QIIME2, were neither Escherichia, Enterococcus nor
observed,
particularly
in
the
classification
of members
from the
-p
previously
ro
Listeria were detected. Differences at the family and genus levels have been
re
Enterobacteriaceae family (Barb et al., 2016). The characterization at species level of identification is key when analyzing clinical samples where distinguishing between
lP
two species in one genus can be important in terms of clinical decisions (Claesson et
na
al., 2017). In Clinical Microbiology, the use of PGM, based on ion semiconductor technology approach that reports outputs with run times of two hours, could provide a
ur
good advantage (Claesson et al., 2017). However, as showed in this study, PGM
Jo
presents limitation in characterizing the following species Bacteroides vulgatus or Enterococcus faecalis at any level (Fig. 3). Overall, albeit considering these limitations, clinical identification based on 16S rRNA gene shows a genus-level concordance rate of 96% and a species-level concordance rate of 87.5% (Srinivasan et al., 2015). It has recently been reported that the use of third-generation sequencing platforms such as PacBio CGC targeting the full-length gene sequence of the 16S rRNA gene could improve the identification rate at species level (Johnson et al., 2019).
Journal Pre-proof In labs with the abovementioned technology reporting data from short-read sequencing platforms, identification at species level can be improved when using appropriate software such as SPINGO. SPINGO has been reported to be a standalone, rapid, and accurate classification for samples from different environments. Recently, EscobarCepeda et al., reported that SPINGO was the method with the highest accuracy at species level when using RDP database (Escobar-Zepeda et al., 2018). Although with
of
some restrictions in the relative abundance calculation due to the problems discussed,
ro
second-generation NGS can still be a good strategy when comparing two different
-p
populations (patients vs. controls; treated vs. untreated, etc.). In our study, SPINGO
re
detected 90-100% of the expected species, also using RDP in data from Illumina MiSeq.
lP
Using the results of our studies, and in order to provide a benchmark analysis guide for
na
non-professional bioinformatician research, Table 5 contains the “pros”, “cons”, and “indications” of the two sequencing platforms and analytical pipelines.
ur
To conclude, no statistical differences were found between observed and expected taxa
Jo
up to genus level, albeit with small variations, indicating that all of them can be used for metagenomic analysis at this level. However, we may consider individual limitations in the identification of specific genus and many restrictions at species level, SPINGO being the best approach at this level. Presumably, the sequencing of fulllength 16S intragenomic copy variants has the potential to provide taxonomic resolution of bacterial communities at species and strain level. Authors’ contribution
Journal Pre-proof MB, CB and RGS design the experiments. CB perform the experiments. MB, CB, ML and RGS analyze the data. MB and RGS wrote the paper. CB and ML reviewed the paper and make substantial contributions to the final version. Data availability The raw sequencing data can be requested by researchers and will be available after
of
the deposition in the NCBI public database shortly.
ro
Acknowledgments:
-p
This research was supported by the Ramón y Cajal grant 2012_11910 and the
re
AGL2016-77288-R project from the Ministry of Economy and Competitiveness, Spain, the “Cátedra ASISA” project 2015/UEM46, and the 2013/UEM19 and projects from the Universidad
lP
2015/UEM06
Europea de Madrid.
We also
ur
STABvida, Portugal.
na
acknowledge the services of Parque Científico de Madrid, Spain and the company,
Jo
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Nº Secuencia SUB6291805 https://submit.ncbi.nlm.nih.gov/subs/sra/SUB6291805/files REFERENCES Allali, I., Arnold, J.W., Roach, J., Cadenas, M.B., Butz, N., Hassan, H.M., Koci, M., Ballou, A., Mendoza, M., Ali, R., Azcarate-Peril, M.A., 2017. A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of
Journal Pre-proof the gut microbiome. BMC Microbiol. 17, 1–16. https://doi.org/10.1186/s12866017-1101-8 Allard, G., Ryan, F.J., Jeffery, I.B., Claesson, M.J., 2015. SPINGO: A rapid speciesclassifier for microbial amplicon sequences. BMC Bioinformatics 16, 1–8. https://doi.org/10.1186/s12859-015-0747-1
of
Aronesty, E., 2013. Comparison of Sequencing Utility Programs. Open Bioinforma. J.
ro
7, 1–8. https://doi.org/10.2174/1875036201307010001
-p
Barb, J.J., Oler, A.J., Kim, H.S., Chalmers, N., Wallen, G.R., Cashion, A., Munson, P.J., Ames, N.J., 2016. Development of an Analysis Pipeline Characterizing
re
Multiple Hypervariable Regions of 16S rRNA Using Mock Samples. PLoS One
lP
11, 1–18. https://doi.org/10.1371/journal.pone.0148047
na
BEI Resources, 2014. The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from
ur
Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S
Jo
rRNA Gene Sequencing, HM-783D. BEI Resources,10801 Uni. Bokulich, N.A., Kaehler, B.D., Rideout, J.R., Dillon, M., Bolyen, E., Knight, R., Huttley, G.A., Caporaso, J.G., 2018. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2 ’ s q2-feature-classifier plugin. Microbiome 6, 90. https://doi.org/10.1186/s40168-018-0470-z Bokulich, N.A., Subramanian, S., Faith, J.J., Gevers, D., Gordon, J.I., Knight, R., Mills, D.A., Caporaso, J.G., 2013. Quality- filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat. Methods 10, 57–59.
Journal Pre-proof https://doi.org/10.1038/nmeth.2276 Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., Bai, Y., Bisanz, J.E., Bittinger, K., Brejnrod, A., Brislawn, C.J., Brown, C.T., Callahan, B.J., Caraballo-Rodríguez, A.M., Chase, J., Cope, E., Silva, R. Da, Dorrestein, P.C., Douglas, G.M., Durall, D.M., Duvallet, C., Edwardson, C.F., Ernst, M., Estaki,
of
M., Fouquier, J., Gauglitz, J.M., Gibson, D.L., Gonzalez, A., Gorlick, K., Guo, J.,
ro
Hillmann, B., Holmes, S., Holste, H., Huttenhower, C., Huttley, G., Janssen, S.,
-p
Jarmusch, A.K., Jiang, L., Kaehler, B., Kang, K. Bin, Keefe, C.R., Keim, P.,
re
Kelley, S.T., Knights, D., Koester, I., Kosciolek, T., Kreps, J., Langille, M.G.,
lP
Lee, J., Ley, R., Liu, Y.-X., Loftfield, E., Lozupone, C., Maher, M., Marotz, C., Martin, B.D., McDonald, D., McIver, L.J., Melnik, A. V, Metcalf, J.L., Morgan,
na
S.C., Morton, J., Naimey, A.T., Navas-Molina, J.A., Nothias, L.F., Orchanian, S.B., Pearson, T., Peoples, S.L., Petras, D., Preuss, M.L., Pruesse, E., Rasmussen,
ur
L.B., Rivers, A., Michael S Robeson, I., Rosenthal, P., Segata, N., Shaffer, M.,
Jo
Shiffer, A., Sinha, R., Song, S.J., Spear, J.R., Swafford, A.D., Thompson, L.R., Torres, P.J., Trinh, P., Tripathi, A., Turnbaugh, P.J., Ul-Hasan, S., Hooft, J.J. van der, Vargas, F., Vázquez-Baeza, Y., Vogtmann, E., Hippel, M. von, Walters, W., Wan, Y., Wang, M., Warren, J., Weber, K.C., Williamson, C.H., Willis, A.D., Xu, Z.Z., Zaneveld, J.R., Zhang, Y., Zhu, Q., Knight, R., Caporaso, J.G., 2018. QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. https://doi.org/10.7287/peerj.preprints.27295v2 Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., Holmes,
Journal Pre-proof S.P., 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583. https://doi.org/10.1038/nmeth.3869 Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I., Huttley, G.A., Kelley, S.T., Knights, D., Koenig, J.E., Ley, R.E., Lozupone, C.A., McDonald, D., Muegge, B.D., Pirrung, M., Reeder, J., Sevinsky, J.R., Turnbaugh, P.J., Walters,
of
W.A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R., 2010. QIIME
-p
335–6. https://doi.org/10.1038/nmeth.f.303
ro
allows analysis of high-throughput community sequencing data. Nat. Methods 7,
re
Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Huntley, J., Fierer, N.,
lP
Owens, S.M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J.A., Smith, G., Knight, R., 2012. Ultra-high-throughput microbial community analysis on the
na
Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621–1624.
ur
https://doi.org/10.1038/ismej.2012.8
Jo
Carlos, N., Tang, Y.-W., Pei, Z., 2012. Pearls and pitfalls of genomics-based microbiome analysis. Emerg. Microbes Infect. 1, 1–3. https://doi.org/10.1038/emi.2012.41 Claesson, M.J., Clooney, A.G., O’Toole, P.W., 2017. A clinician’s guide to microbiome analysis. Nat. Rev. Gastroenterol. Hepatol. 14, 585–595. https://doi.org/10.1038/nrgastro.2017.97 Claesson, M.J., Wang, Q., O’Sullivan, O., Greene-Diniz, R., Cole, J.R., Ross, R.P., O’Toole, P.W., 2010. Comparison of two next-generation sequencing
Journal Pre-proof technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Res. 38, e200–e200. https://doi.org/10.1093/nar/gkq873 D’Argenio, V., Casaburi, G., Precone, V., Salvatore, F., 2014. Comparative metagenomic analysis of human gut microbiome composition using two different bioinformatic pipelines. Biomed Res. Int. 2014, 325340.
ro
of
https://doi.org/10.1155/2014/325340
Edgar, R.C., 2017. Accuracy of microbial community diversity estimated by closed-
-p
and open-reference OTUs. PeerJ 5, e3889. https://doi.org/10.7717/peerj.3889
re
Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST.
lP
Bioinformatics 26, 2460–2461. https://doi.org/10.1093/bioinformatics/btq461
na
Edgar, R.C., Haas, B.J., Clemente, J.C., Quince, C., Knight, R., 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27, 2194–
ur
2200. https://doi.org/10.1093/bioinformatics/btr381
Jo
Escobar-Zepeda, A., Godoy-Lozano, E.E., Raggi, L., Segovia, L., Merino, E., Gutiérrez-Rios, R.M., Juarez, K., Licea-Navarro, A.F., Pardo-Lopez, L., SanchezFlores, A., 2018. Analysis of sequencing strategies and tools for taxonomic annotation: Defining standards for progressive metagenomics. Sci. Rep. 8, 12034. https://doi.org/10.1038/s41598-018-30515-5 Escobar-Zepeda, A., Vera-Ponce de León, A., Sanchez-Flores, A., 2015. The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics. Front. Genet. 6, 348. https://doi.org/10.3389/fgene.2015.00348
Journal Pre-proof Janda, J.M., Abbott, S.L., 2007. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls. J. Clin. Microbiol. https://doi.org/10.1128/JCM.01228-07 Johnson, J.S., Spakowicz, D.J., Hong, B.Y., Petersen, L.M., Demkowicz, P., Chen, L., Leopold, S.R., Hanson, B.M., Agresta, H.O., Gerstein, M., Sodergren, E., Weinstock, G.M., 2019. Evaluation of 16S rRNA gene sequencing for species
ro
https://doi.org/10.1038/s41467-019-13036-1
of
and strain-level microbiome analysis. Nat. Commun. 10.
re
https://doi.org/10.3390/d2020207
-p
Jost, L., 2010. The relation between evenness and diversity. Diversity 2, 207–232.
lP
Jünemann, S., Sedlazeck, F.J., Prior, K., Albersmeier, A., John, U., Kalinowski, J.,
na
Mellmann, A., Goesmann, A., Von Haeseler, A., Stoye, J., Harmsen, D., 2013. Updating benchtop sequencing performance comparison. Nat. Biotechnol.
ur
https://doi.org/10.1038/nbt.2522
Jo
Kaehler, B., Bokulich, N., McDonald, D., Knight, R., Caporaso, G., Huttley, G., 2018. Species abundance information improves sequence taxonomy classification accuracy 406611. https://doi.org/10.1101/406611 Kim, B.R., Shin, J., Guevarra, R.B., Lee, J.H., Kim, D.W., Seol, K.H., Lee, J.H., Kim, H.B., Isaacson, R.E., 2017. Deciphering diversity indices for a better understanding of microbial communities. J. Microbiol. Biotechnol. 27, 2089– 2093. https://doi.org/10.4014/jmb.1709.09027 Klindworth, A., Pruesse, E., Schweer, T., Peplies, J., Quast, C., Horn, M., Glöckner,
Journal Pre-proof F.O., 2013. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 41, e1–e1. https://doi.org/10.1093/nar/gks808 Kullback, S., Leibler, R.A., 1951. On Information and Sufficiency. Ann. Stat. 22, 79– 86. https://doi.org/10.1214/aoms/1177729694
of
Loman, N.J., Misra, R. V., Dallman, T.J., Constantinidou, C., Gharbia, S.E., Wain, J.,
ro
Pallen, M.J., 2012. Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 30, 434–439.
-p
https://doi.org/10.1038/nbt.2198
re
Nguyen, N.-P., Warnow, T., Pop, M., White, B., 2016. A perspective on 16S rRNA
lP
operational taxonomic unit clustering using sequence similarity. npj Biofilms
na
Microbiomes 2, 16004. https://doi.org/10.1038/npjbiofilms.2016.4 Nilakanta, H., Drews, K.L., Firrell, S., Foulkes, M.A., Jablonski, K.A., 2014. A review
ur
of software for analyzing molecular Sequences. BMC Res. Notes.
Jo
https://doi.org/10.1186/1756-0500-7-830 Noecker, C., McNally, C.P., Eng, A., Borenstein, E., 2017. High-resolution characterization of the human microbiome. Transl. Res. 179, 7–23. https://doi.org/10.1016/j.trsl.2016.07.012 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Journal Pre-proof Plummer, E., Twin, J., Bulach, D.M., Garland, S.M., Tabrizi, S.N., 2015. A Comparison of Three Bioinformatics Pipelines for the Analysis of Preterm Gut Microbiota using 16S rRNA Gene Sequencing Data. J Proteomics Bioinform Cit. 8, 283–291. https://doi.org/10.4172/jpb.1000381 Pylro, V.S., Morais, D.K., de Oliveira, F.S., dos Santos, F.G., Lemos, L.N., Oliveira, G., Roesch, L.F.W., 2016. BMPOS: a Flexible and User-Friendly Tool Sets for
of
Microbiome Studies. Microb. Ecol. 72, 443–447. https://doi.org/10.1007/s00248-
ro
016-0785-x
-p
Pylro, V.S., Roesch, L.F.W., Morais, D.K., Clark, I.M., Hirsch, P.R., Tótola, M.R.,
re
2014. Data analysis for 16S microbial profiling from different benchtop
lP
sequencing platforms. J. Microbiol. Methods 107, 30–37.
na
https://doi.org/10.1016/j.mimet.2014.08.018 Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni,
ur
A., Swerdlow, H.P., Gu, Y., 2012. A tale of three next generation
Jo
sequencingplatforms: comparison of Ion Torrent, PacificBiosciences and Illumina MiSeq sequencers. BMC Genomics 13, 1–13. https://doi.org/10.1186/1471-216413-341
Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C., Ghalith, G.A. Al, Alexander, H., Alm, E.J., Arumugam, M., Bai, Y., Bisanz, J.E., Bittinger, K., Brejnrod, A., Colin, J., Brown, C.T., Callahan, B.J., Mauricio, A., Rodríguez, C., Chase, J., Cope, E., Silva, R. Da, Dorrestein, P.C., Douglas, G.M., Duvallet, C., Edwardson, C.F., Ernst, M., Fouquier, J., Gauglitz, J.M., Gibson, D.L., Gonzalez,
Journal Pre-proof A., Huttley, G.A., Janssen, S., Jarmusch, A.K., Kaehler, B.D., Kang, K. Bin, Keefe, C.R., Keim, P., Kelley, S.T., Ley, R., Loftfield, E., Marotz, C., Martin, B., Mcdonald, D., Mciver, L.J., Alexey, V., Metcalf, J.L., Morgan, S.C., Morton, J.T., Naimey, A.T., 2018. QIIME 2 : Reproducible , interactive , scalable , and extensible microbiome data science. https://doi.org/10.7287/peerj.preprints.27295 Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F., 2016. VSEARCH: a
of
versatile open source tool for metagenomics. PeerJ 4, e2584.
ro
https://doi.org/10.7717/peerj.2584
-p
Srinivasan, R., Karaoz, U., Volegova, M., MacKichan, J., Kato-Maeda, M., Miller, S.,
re
Nadarajan, R., Brodie, E.L., Lynch, S. V., 2015. Use of 16S rRNA Gene for
lP
Identification of a Broad Range of Clinically Relevant Bacterial Pathogens. PLoS One 10, e0117617. https://doi.org/10.1371/journal.pone.0117617
na
Větrovský, T., Baldrian, P., 2013. The Variability of the 16S rRNA Gene in Bacterial
ur
Genomes and Its Consequences for Bacterial Community Analyses. PLoS One 8,
Jo
e57923. https://doi.org/10.1371/journal.pone.0057923 Yarza, P., Yilmaz, P., Pruesse, E., Glöckner, F.O., Ludwig, W., Schleifer, K.-H., Whitman, W.B., Euzéby, J., Amann, R., Rosselló-Móra, R., 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645. https://doi.org/10.1038/nrmicro3330 Zhang, J., Chiodini, R., Badr, A., Zhang, G., 2011. The impact of next-generation sequencing on genomics. J. Genet. Genomics 38, 95–109.
Journal Pre-proof
Jo
ur
na
lP
re
-p
ro
of
https://doi.org/10.1016/j.jgg.2011.02.003
Journal Pre-proof TABLES Table 1. Data processing pipelines QIIME v1.9.1 default parameters
VSEARCH v2.3.4
QIIME2 v2018.2
Assembled of pair end reads (MiSeq): multiple_join_paired_ends PGM: single ends read split_libraries_fastq Q≥20 - split_libraries_fa DADA2 denoising (pstq (default for max-ee 1.0): PGM and - qiime dada2 turning off the denoisefilters for paired Illumina (r=0; (Illumina) q=0; n=100)) - Quality filtering: - qiime dada2 denoiseVSEARCH single (fastq_maxee (PGM) 1.0) Discard
ro
of
DAT A PRET REAT MENT
QIIME v1.9.1 modified parameters
T AXONOMIC ANALYSIS (Refence database)
DIVERSIT Y ANALYSIS
OT U picking: de novo (included in DADA2)
No
Included in DADA2
Assign taxonomy using UCLUST method on QIIME (assign_taxonomy. py)
qiime featureclassifier classify-sklearn
re
lP
No
OT U picking: de novo OUT clustering: UPARSE method (VSEARCH) (-id 0.97)
na
CHIMERA REMOVAL
OT U picking: de novo (pick_otus.py) OT U clustering: UCLUST (default) (pick_rep_set.py) (align_seqs.py) (default: PyNast) CHIMERA SLAYER MET HOD (identify_chimeric_seq s.py) and (filter_alignment.py)
ur
(OT U picking)
OT U picking: open reference (pick_open_reference _otus)
Jo
CLUST ERING MET HOD
-p
singletons (VSEARCH) (minsize=2)
(assign_taxonomy.py) with UCLUST method and Greengenes13_8 97% as reference database
Greengenes 97% with UCLUST method
Qiime diversity alpha-rarefaction (QIIME 2_2018.8)
Greengenes13_ 8 97%
Journal Pre-proof Table 2. Expected and detected taxa at staggered mock community at family, genus and species level. A true positive is a taxon that was both observed and expected; A false positive is a taxon that was observed but not expected; A false negative is a taxon that was expected but not observed. FAMILY expected
detected
TP
FP
PGM
ILL
PGM
37
24
17
16
34
36
17
17
VSEARCH
26
26
17
QIIME 2
20
19
QIIME 1 MOD
PGM
17
PGM
8
0
1
17
19
0
0
9
9
0
0
5
4
1
1
-p
17 16
re
16
ILL
20
ro
QIIME 1
ILL
of
ILL
MO CK
FN
GENUS
detected
lP
expected
QIIME 1 QIIME 1 MOD
Jo
QIIME 2
expected
FN
PGM
ILL
PGM
ILL
PGM
ILL
PGM
60
39
17
16
43
23
0
1
52
52
16
16
36
36
1
1
39
38
16
16
23
22
1
1
23
26
15
15
8
11
2
2
ur
VSEARCH
17
FP
ILL
na
MO CK
TP
SPECIES detected
TP
FP
FN
ILL
PGM
ILL
PGM
ILL
PGM
ILL
PGM
77
52
8
8
69
44
12
12
63
66
6
7
57
59
14
13
VSEARCH
51
52
6
7
45
45
14
13
QIIME 2
27
34
4
5
23
29
16
15
PIPELINE
MO CK
QIIME 1 QIIME 1 MOD
20
ILL: Illumina M iSeq; PGM : Ion Personal Genome M achine. FP: false positive; FN: false negative, TP: true positive
Journal Pre-proof Table 3. Kullback-Leiber (KL) divergence and Spearman correlation coefficients for known mock composition and obtained mock composition after sequencing and analysis.
**
P < 0.0001.
Family
Genus
Species
Family
Genus
Species
ILL
0.062
2.247
8.219
0.722**
0.464**
-0.031
PGM
0.115
1.483
6.273
0.855**
0.606**
QIIME1 MOD
ILL
0.132
4.418
11.290
of
0.110
0.750**
0.519**
-0.470
PGM
0.183
4.589
8.952
0.808**
0.525**
0.009
VSEARCH
ILL
0.025
3.846
ro
QIIME1
10.219
0.852**
0.581**
0.004
PGM
0.225
4.131
9.184
0.768**
0.465**
0.055
ILL
0.473
4.020
15.997
0.798**
0.520**
0.024
PGM
0.805
4.423
12.527
0.550**
0.049
0.033
QIIME 2
-p
Platform
lP
Pipeline
Spearman correlation coefficient
re
Kullback-Leiber Divergence (DKL)
Jo
ur
na
ILL= Illumina M iSeq Platform; PGM : Ion Personal Genome M achine
Journal Pre-proof Table 4. Expected and detected taxa at staggered mock community at species level for SPINGO classifier. A true positive is a taxon that was both observed and expected; A false positive is a taxon that was observed but not expected; A false negative is a taxon that was expected but not observed. S pecies expected ILL
20
S pecies detected 95
True positive species 20
False positive species 75
False negative species 0
% Unassigned 56.1
ro
of
PGM 33 18 15 2 71.5 ILL: Illumina M iSeq; PGM : Ion Personal Genome M achine. FP: false positive; FN: false negative, TP: true positive
Jo
ur
na
lP
re
-p
.
Journal Pre-proof Table 5. Results of pipelines comparisons showing the “pros,” “cons,” and general indications for non-professional users.
Q IIME1
Q IIME1 MOD
Q IIME 2
Database Pros
SPING O species clasif f ier
Greengenes Best TDR, detection of E. coli,
Low divergence (D KL ) and high correlation for P GM
Low divergence and high correlation for ILL
High correlation for ILL at family level
High number of true positive species (detection of 90-100% species) Lowest number of false negative
Non-unassigned OTUs for ILL
of
Detection of species at low operon levels
Lowest TAR
4-5 species detected
7-8 species detected
P GM missed 7 expected genera: (Listeria, Escherichia, Pseudomonas, Enterococcus, Deinococcus, Actinomyces and Acinetobacter)
Heatmap: close similarity with between platforms
6-7 species detected
High percentage of unassigned data and false positive species
Best pipeline for ILL
Good pipeline at species level for ILL when dealing with taxonomic analysis
-p
Bacteroides and Bacteriodaceae not detected by P GM.
Based on RDP
Best TAR
Low divergence (D KL ) and high correlation for both ILL and P GM
Cons
VSEARCH
ro
Pipeline
re
8 species detected
lP
ILL missed Listeria and Escherichia Lowest TDR
Good pipeline for ILL Select this pipeline for P GM except if the target is the gut microbiota (Bacteroides is not detected)
P ipeline for P GM when identification of Bacteroides is needed
ur
Best pipeline for P GM
Jo
Indications
correlation
na
Non-significant P GM
Best accuracy (TAR) Not useful for P GM Not able to detect Listeria and Escherichia This pipeline only for Illumina when you look for high accuracy rate
for
Journal Pre-proof Figure legends Figure 1. Family level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) and four processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2) separated according to operon concentrations. Unassigned: non-identified features; Under-classified: features expected identified only in upper taxa; Misclassified:
ro
of
observed features that were not expected.
Figure 2. Genus level bacterial relative abundance for known and observed staggered
-p
mock community analyzed with two platforms (Illumina and PGM) and four
re
processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2)
lP
separated according to operon concentrations. Unassigned: non-identified features; Under-classified: features expected identified only in upper taxa; Misclassified:
na
observed features that were not expected.
ur
Figure 3. Incidence map showing bacteria found at the family, genus and species
Jo
levels (species (X), genus (green), family (blue), non-detected (pink)). Figure 4. Heatmap of relative abundance at family, genus and species level detected by four bioinformatics strategies (QIIME 1 default, QIIME 1 modified, QIIME 2 and Vsearch) and two sequencing platforms (ILL (Illumina MiSeq) and PGM) compared to the known relative abundance (HM-783D) (a, b and c). Figura 5. Heatmap of relative abundance at species level for SPINGO species classifier. Only the 20 expected taxa were included.
Journal Pre-proof Figure 6. (A) Taxon detection rate (TDR) and (B) taxon accuracy rate (TAR) at the family, genus and species levels. TDR was calculated as the number of true positive features divided by the total number of expected features. TAR is the number of true positive features divided by the total number of observed features. ILL = Illumina MiSeq platform; PGM = Ion Personal Genome Machine. Figure 7. Alpha diversity for different platforms-strategies. Alpha diversity indices for
-p
SUPPLEMENTARY MATERIAL
ro
evenness (c), and Shannon diversity index (d).
of
a range of rarified sequence depths: observed OTUs (a), Chao1 index (b), Pielou’s
re
Figure 1s. Family level bacterial relative abundance for known and observed
lP
staggered mock samples with two platforms (Illumina and PGM) and four processing
na
strategies (QIIME default, QIIME modified, Vsearch and QIIME 2). Unassigned: nonidentified features; Under-classified: features expected identified only in upper taxa;
ur
Misclassified: observed features that were not expected.
Jo
Figure 2s. Genus level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) and four processing pipelines (QIIME default, QIIME modified, Vsearch and QIIME 2). Unassigned: non-identified
features; Under-classified: features expected
identified
only in upper taxa; Misclassified: observed features that were not expected. Figure 3s. Species level bacterial relative abundance for known and observed staggered mock community analyzed with two platforms (Illumina and PGM) by SPINGO
Journal Pre-proof FIGURES Figure 1 a)
b) 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 0.0000
Stap hyloco ccaceae
QIIME1
QIIME2
Clostrid ia ceae
of
Streptococcaceae
PGM
ILL
c)
d)
0.0035
MOCK
Pseudomonadacea e
PGM
ro
MOCK
VSEARCH
QIIME1 MOD
QIIME1
QIIME2
ILL
HM783D
VSEARCH
QIIME1 MOD
QIIME1
0.0000
QIIME2
0.1000
Bacillaceae
VSEARCH
Rho dobacteraceae
QIIME1 MOD
0.2000
QIIME1
Enteroba cteriaceae
0.3000
HM783D
0.4000
QIIME2
0.5000
VSEARCH
0.6000
QIIME1 MOD
0.7000
-p
0.0140
0.0030
0.0120
0.0025
0.0100
0.0020
Moraxellaceae
0.0080
Actin omyceta ceae
0.0010
Bacteroidaceae
Helicobacteraceae
0.0060
re
0.0015
PGM
Jo
ur
e)
Listeriaceae
MOCK
QIIME2_PGM
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
QIIME2
VSEARCH ILL
VSEARCH_PG QIIME1 M MOD_PGM
HM783D
0.0000
QIIME1 MOD
QIIME2
VSEARCH
QIIME1 MOD
MOCK
Lacto bacillaceae
0.0020
Enterococca ceae
na
ILL
QIIME1
HM783D
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
0.0000
lP
Deinococcaceae
QIIME1
0.0005
0.0040
Neisseria ceae Propionibacteriaceae
PGM
QIIME1 VSEARCH_IL L MOD_ILL
QIIME2_ILL
QIIME1_ILL
QIIME1_PGM MOCK HM783D
QIIME1_ILL
QIIME2_PG M
QIIME1-ILL QIIME2_ILL
QIIME1 MOD_ILL VSEARCH_P QIIME1 GM QIIME1_PG MOD_PGM M
Unassigned
VSEARCH_IL MOCK L HM783D QIIME2_ILL
QIIME1 MOD_ILL
QIIME2_PG M
VSEARCH_ILL
Missclassified
Underclassified
QIIME1_PG MOCK MHM783D QIIME1 MOD_PGM VSEARCH_P GM
Journal Pre-proof Figure 2 a)
b)
0.3500
0.0350
0.3000
0.0300
0.2500
0.0250
0.2000
0.0200
0.1500
Streptococcus
0.1000
Stap hyloco ccus
0.0500
0.0150
Pseudomonas
0.0100
Clostrid iu m
0.0050
Rho dobacter
0.0000
0.0030 0.0100 Propionibacterium
0.0020
Neisseria
0.0015
0.0060
0.0020
ILL
na
ur Jo QIIME1_PG M
VSEARCH_I LL MOCK HM783D
QIIME2_ILL
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
HM783D
Actin omyces
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
QIIME2
HM783D MOCK
PGM
Underclassified QIIME1-ILL QIIME2_PG M
QIIME1-ILL
QIIME2_PG M
QIIME1 MOD_ILL
QIIME1 MOD_ILL
VSEARCH_P GMQIIME1 MOD_PGM
VSEARCH
Acineto bacter
QIIME1
re
Helicobacter
Missclassified
QIIME1-ILL
QIIME2_PG M
Bacteroides
0.0000
lP
Unassigned
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
HM783D
QIIME2
VSEARCH
QIIME1
QIIME1 MOD
e)
Deinococcus
0.0005
Lacto bacillus
0.0000
Enterococcu s
0.0010
Listeria
PGM
VSEARCH
-p
0.0040
PGM
ro
0.0025
0.0080
MOCK
of
d)
0.0120
MOCK
QIIME1 MOD ILL
PGM
c)
ILL
QIIME2
Bacillus QIIME1
QIIME2
VSEARCH
QIIME1 MOD
MOCK
Escherichia
QIIME1 MOD
ILL
QIIME1
HM783D
QIIME2
VSEARCH
QIIME1 MOD
QIIME1
0.0000
QIIME1 MOD_ILL
VSEARCH_I LL
VSEARCH_ PGM
QIIME2_ILL
QIIME2_ILL
VSEARCH_ PGM QIIME1 MOD_PGM
QIIME1 MOD_PGM VSEARCH_I LL
QIIME1_PG M
MOCK HM783D
QIIME1_PG M
MOCK HM783D
Journal Pre-proof
Q IIME 1 MO D
Q IIME 1 Operons
Mock species
PGM
ILL
PGM
ILL
VSEARCH PGM ILL
Q IIME 2 PGM
ILL
Actinomyces odontolyticus Bacteroides vulgatus 1,000
Deinococcus radiodurans
x
Species
Enterococcus faecalis
Genus
Streptococcus pneumoniae
Family only
x
x
x
x
Propionibacterium acnes
x
x
x
x
x
x
x
x
Bacillus cereus
x
x
x
x
x
x
x
Staphylococcus aureus
x
x
x
-p
Non- detected x
x
x
x
Streptococcus agalactiae
x
x
x
x
x
Escherichia/Shigella coli
x
x
Rhodobacter sphaeroides
x
lP
Acinetobacter baumannii Helicobacter pylori
x
x
x
of
Lactobacillus gasseri
10,000
Listeria monocytogenes
Clostridium beijerinckii 100,000
1,000,000
Staphylococcus epidermidis x
ur Jo
x
x
x
x
x
x
x
x
x
x
x
x
na
Streptococcus mutans
Figure 3
re
Pseudomonas aeruginosa
ro
Neisseria meningitidis
x
Journal Pre-proof Figure 4 a)
ro
of
QIIME2 ILL QIIME1 MOD PGM VSEARCH PGM QIIME2 PGM QIIME1 PGM HM783D QIIME1 MOD ILL QIIME1 ILL VSEARCH ILL
HM783D QIIME1 MOD PGM VSEARCH PGM QIIME2 ILL QIIME2 PGM QIIME1 PGM QIIME1 MOD ILL QIIME1 ILL VSEARCH ILL
c)
Jo
ur
na
lP
re
-p
b)
HM783D VSEARCH PGM QIIME1 ILL QIIME1 MOD ILL QIIME2 ILL QIIME1 MOD PGM QIIME2 PGM QIIME1 PGM VSEARCH ILL
Journal Pre-proof Figure 5
Jo
ur
na
lP
re
-p
ro
of
HM783D SPINGO PGM SPINGO ILL
0.00
FAMILY GENUS
ILL
PGM SPECIES
QIIME 2
VSEARCH
QIIME 1 MOD
QIIME 1
ILL
of QIIME 2
VSEARCH
QIIME 1 MOD
QIIME 1
QIIME 2
ro
VSEARCH
QIIME 1 MOD
QIIME 1
QIIME 2
VSEARCH
QIIME 1 MOD
QIIME 1
0.00
QIIME 2
GENUS
-p
re
FAMILY
VSEARCH
b)
QIIME 1 MOD
TAR
lP
0.80
QIIME 1
na
0.60
QIIME 2
VSEARCH
QIIME 1 MOD
0.40
ur
0.20
Jo
QIIME 1
Journal Pre-proof
Figure 6
a)
TDR
1.00
0.80
0.60
0.40
0.20
SPECIES
PGM
Journal Pre-proof Figure 7
of
a)
ur Jo
c)
na
lP
re
-p
ro
b)
Journal Pre-proof
Jo
ur
na
lP
re
-p
ro
of
d)
Journal Pre-proof SUPPLEMENTARY MATERIAL Figure 1s 100%
Actinomycetaceae Propionibacteriaceae
90%
Bacteroidaceae Bacillaceae
80%
Listeriaceae Staphylococcaceae
70%
of
Enterococcaceae
Lactobacillaceae Streptococcaceae
ro
60%
50%
Rhodobacteraceae
-p
40%
Clostridiaceae
Helicobacteraceae
re
30%
Neisseriaceae
Moraxellaceae
lP
20%
Enterobacteriaceae
0%
QIIME1 MO D VSEARCH
QIIME2
ur
ILL
Jo
QIIME1
na
10%
Pseudomonadaceae Deinococcaceae Unassigned Underclassified
HM783D MOCK
QIIME1
QIIME1 MO D VSEARCH PGM
QIIME2
Missclassified
Journal Pre-proof Figure 2s
100% 90% 80% 70%
60%
of
50% 40%
ro
30%
-p
20% 10%
re
0%
lP
QIIME1 QIIME1 VSEARCH QIIME2 HM783D QIIME1 QIIME1 VSEARCH QIIME2 MOD MOD
MOCK
Jo
ur
na
ILL
PGM
Acinetobacter Actinomyces Bacillus Bacteroides Clostridium Deinococcus Enterococcus Escherichia Helicobacter Lactobacillus Listeria Neisseria Propionibacterium Pseudomonas Rhodobacter Staphylococcus Streptococcus Underclassified Missclassified Unassigned
Journal Pre-proof Figure 3s 0.25
0.03 0.025
0.2
0.02
0.15 0.015 0.1
0.01
0.05
0.005
0 E scherichia c ol i
Rhodobac te r sphae roi de s
SPINGO_ PGM
Staphy lococcus e pi de rmidis
SPINGO_ILL
Bacill us cere us
HM-783D
SPINGO_ PGM
Pseudomonas aeruginosa
0.0012
0.005
0.001
0.004
0.0008
0.003
0.0006
0.002
0.0004
0.001
0.0002
Staphy lococcus aureus
SPINGO_ILL
Stre ptococcus agalactiae
HM-783D
He lic obacter pyl ori
Lis te ria monoc ytoge nes
Ne isse ria me ningitidis
Propionibac te rium acnes
SPINGO_ILL
HM-783D
Jo
ur
na
lP
SPINGO_ PGM
Lactobaci llus gas seri
0
re
Aci ne tobacter baumannii
-p
ro
0.006
0
Clostri di um beij erinck ii
Stre ptococcus mutans
of
0
Actinomyc es odontol yticus
Bacteroides vulgatus
SPINGO_ PGM
De inococ cus radi odurans
E nte rococc us fae calis
SPINGO_ILL
Stre ptococcus pne umoni ae
HM-783D
Journal Pre-proof Highlights
ur
na
lP
re
-p
ro
of
16S rRNA short-read sequence data gene presents limitations at species level. For metagenomic analysis up to genus level, most classifiers perform well. SPINGO classifier is useful to address the limitations at species level. For PGM data, the optimal pipeline was QIIME 1.9.1 with default parameters. All the protocols implemented in QIIME performed well for Illumina data.
Jo