A novel method to determine the minimum number of sequences required for reliable microbial community analysis

A novel method to determine the minimum number of sequences required for reliable microbial community analysis

Accepted Manuscript A novel method to determine the minimum number of sequences required for reliable microbial community analysis Jiajia Ni, Xiaojin...

851KB Sizes 3 Downloads 99 Views

Accepted Manuscript A novel method to determine the minimum number of sequences required for reliable microbial community analysis

Jiajia Ni, Xiaojing Li, Zhili He, Meiying Xu PII: DOI: Reference:

S0167-7012(17)30156-2 doi: 10.1016/j.mimet.2017.06.006 MIMET 5182

To appear in:

Journal of Microbiological Methods

Received date: Revised date: Accepted date:

9 February 2017 6 June 2017 6 June 2017

Please cite this article as: Jiajia Ni, Xiaojing Li, Zhili He, Meiying Xu , A novel method to determine the minimum number of sequences required for reliable microbial community analysis, Journal of Microbiological Methods (2017), doi: 10.1016/j.mimet.2017.06.006

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT REVISED A Novel Method to Determine the Minimum Number of Sequences Required for Reliable Microbial Community Analysis Running title: Sequence number in microbiota analysis

PT

Jiajia Ni1,2,3,4, Xiaojing Li1,2,Zhili He5, Meiying Xu1,2* 1

MA

NU

SC

RI

Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Institute of Microbiology, Guangzhou, China 2 State Key Laboratory of Applied Microbiology Southern China, Guangzhou, China 3 Department of Hepatobiliary Surgery II, Guangdong Provincial Research Center of Artificial Organ and Tissue Engineering, Zhujiang Hosplital of Southern Medical University, Guanzhou, China 4 State Key Laboratory of Organ Failure Research, Southern Medical University, Guangzhou, China 5 Institute for Environmental Genomics and Department of Microbiology and Plant Biology, University of Oklahoma, Norman, OK, USA

AC

CE

PT E

Number of words: 4795 Number of figures: 5

D

* Correspondence: Meiying Xu, Guangdong Institute of Microbiology, Guangzhou, 510070, China. E-mail: [email protected]

1

ACCEPTED MANUSCRIPT

SC

RI

PT

Abstract Although high-throughput sequencing is an efficient approach to study the structure of microbial communities in detail, it is still impossible to enumerate all individuals using this method. Therefore, it is a common strategy to generate sampling datasets that are representative of the assemblages. However, the representativeness of these sampling datasets has never been assessed. In this study, we developed a method to determine the minimum number sequences that are required to be analyzed to obtain a reliable description of microbial community structure. First, a set of datasets from microbial communities were constructed by in silico sampling at different depths. Second, the correlation equation between dissimilarity of the sampling datasets and sampling depths was fitted, and thereby the minimum number of 16S rRNA gene sequences was predicted. Finally, we verified the method using empirical data of microbiota from a laboratory mesocosm. Our method showed that at least 5,528,079 sequences were required to reliably characterize microbial communities inhabiting the mesocosms. However, if only dominant OTUs (>1%) were considered, thousands of sequences were enough. This promising method provides a criterion to ensure sequencing sufficiency when analyzing the structure of natural microbial communities.

NU

Keywords: High-throughput sequencing; microbial community; minimum number of sequences; sequencing depth; community structure

AC

CE

PT E

D

MA

1 Introduction Microorganisms inhabit almost every environment in the biosphere and play crucial roles in biogeochemical cycling of essential elements (Horner-Devine et al., 2004; Huse et al., 2008; Zhou et al., 2015; Metcalf et al., 2016). However, adequate characterization and quantification of the structure of microbial communities remains a challenge for microbiologists (Bartram et al., 2011; Zhou et al., 2011) because of the following three reasons: Firstly, a massive amount of microbial cells typically inhabit natural environments. It is estimated that 1 gram of forest soil contains 4×107 prokaryotic cells and 1 gram of grassland soil contains 2×109 prokaryotic cells (Torsvik et al., 2002; Daniel, 2005). Secondly, microbial communities are extremely diverse in natural environments. Natural microbial communities are typically composed of a few dominant species followed by a large number of rare taxa (Hua et al., 2015; Zhou et al., 2015). Estimates of soil microbial diversity range from thousands to a million microbial “species” in a few grams of soil (Gans et al., 2005), and there is a huge difference in their richness (Curtis et al., 2002). Thirdly, the majority of microorganisms have not yet been cultivated (Torsvik et al., 2002; Zhou et al., 2015). It is believed that less than 1% of all bacterial species have been described by traditional detection techniques (Horner-Devine et al., 2004). Biomarker-based techniques, such as DNA analyses, enable the detection of not-yet cultivated species, and 16S rRNA gene is widely used as a taxonomic/phylogenetic biomarker to characterize the composition and structure of microbial communities. High-throughput sequencing technologies, such as 454 pyrosequencing and Illumina MiSeq sequencing, can sequence millions of 16S rRNA gene sequences at a reasonable cost, thus providing an efficient tool for studying the structure of microbial communities in detail (Youssef et al., 2009; Caporaso et al., 2012; Tremblay et al., 2015). The structures of various microbial communities inhabiting different environments have been described and quantified by high-throughput sequencing of 2

ACCEPTED MANUSCRIPT

RI

PT

16S rRNA gene amplicons (Roesch et al., 2007; Caporaso et al., 2011; Møller et al., 2013; Li et al., 2014) with sequencing depths ranging from thousands to several millions per sample (Harris et al., 2010; Campbell et al., 2011; Zhou et al., 2011; Caporaso et al., 2011; Mou et al., 2013). However, it is still unclear whether these sampling datasets adequately represent the composition of natural microbial communities; this is a central issue in comparative studies. The quantitative understanding of biogeochemical cycles requires information about the number of species and the abundance of each species (Martín and Goldenfeld, 2006). Although Caporaso et al. (2011) showed that only 2,000 Illumina single-end reads were sufficient to recapture the same relationships among samples, insufficient sampling is one of the major reasons causing a lower reproducibility and overestimation of beta-diversity of microbial communities (Zhou et al., 2011; Zhou et al., 2013). Increasing sequencing depth indeed helped ameliorate those problems (Zhou et al., 2011; Lundin et al., 2012; Ge et al., 2014).

NU

SC

If we can reliably assess the representativeness of a sampling dataset and predict the minimum size of the dataset that can allow reliable community analysis, then researchers can make better decisions and allocate resources more appropriately for studying microbial ecology. In particular, to determine whether the continuously decreasing cost of sequencing should be utilized to obtain deeper sample coverage or to increase the number of samples that are sequenced (Caporaso et al., 2012).

AC

CE

PT E

D

MA

Various methods are used to assess the sufficiency of sequencing depth in microbial community structure analyses (Hughes et al., 2001; Bik et al., 2010; Zhou et al., 2013). Estimation of species richness is often conducted to estimate the number of species in a community and the necessary sequencing depth (Walther and Morand, 1998; Gotelli and Colwell, 2001; Brose et al., 2003). Species accumulation curve (or collector’s curve) is a typical method used to estimate species richness in a community; it is also used to estimate the expected number of new species that can be detected given a level of additional sampling effort (Mao et al., 2005; Dove and Cribb, 2006; Deng et al., 2015). The most often used species accumulation curves are the exponential equation and the Michaelis-Menten model (Brose et al., 2003). Non-parametric estimators can also be employed to estimate the species richness. These methods only require the number of samples in which each species is found rather than any parametric information about their abundances. Although these methods are widely used to estimate species richness in macroecology (Melo and Froehlich, 2001; Hortal et al., 2006), they are hardly used for microbial communities as the sample data do not contain information about the number of rare species at the tail of species abundance distributions (Haegeman et al., 2013). Coverage estimators, such as Good’s coverage, are also used to assess the sufficiency of sequence number (Martín and Goldenfeld, 2006). However, the coverage estimators only provide percentages and do not reveal any information of microbial community structure. In addition, because microbial communities contain massive amounts of rare species, the coverage estimators commonly reach a plateau phase and cannot further increase with the sampling efforts, especially for the data from high-throughput sequencing. Zhou et al. (2013) developed a theoretical framework to predict sampling efforts for achieving a desired reproducibility that is assessed by overlap. It is very useful for researchers to predict sampling efforts. However, the internal structure differences between microbial communities are still unclear after using this method. 3

ACCEPTED MANUSCRIPT Therefore, in the present study, we developed a novel theoretical framework to determine the representativeness of a sampling dataset and to predict the sampling depth required to achieve a desired purpose. In this approach, the sampling was conducted at different depths and the correlation between dissimilarities of the sampling datasets and sampling depths were fitted. In addition, we verified this method by empirical microbial assemblages.

AC

CE

PT E

D

MA

NU

SC

RI

PT

2 Materials and Methods 2.1 Theoretical model The vastness of microbial diversity implies that it is impractical to enumerate all individuals to accurately describe such communities. Sampling a moderate number of individuals to represent communities is a common strategy of microbial ecologists. The representativeness of the sampling dataset increases with the sampling depth (D). Therefore, for a community, the average dissimilarity of sampling datasets at the same sampling depth is smaller and increases with the sampling depth (Figure 1). For the dissimilarity values that are calculated on the basis of OTU abundance datasets, the values were easily influenced by the dominant OTUs. In addition, the relative abundance of the dominant OTUs made it easier to achieve stability with sampling depth. Reduction of the dissimilarity values is logarithmically related to the increase in sampling depth. Therefore, we derived a regression equation as follows, (1) d  a  log10 D  b where variable D is the sampling depth, variable d is the average dissimilarity between different sampling microbial communities at a specific sampling depth ranging from 0 to 1, and parameters a and b are constants that depend on the microbial community structure.

Figure 1 Theoretical model of dissimilarity caused by sampling. A, B, and C denote the microbial community samples with different sampling depths. D denotes entirety. d and d′ denotes the dissimilarity between microbial community samples at the same sampling depth, and the dissimilarity between microbial community samples and entirety, respectively. With replacement, the expected sampling depth is given by the following equation: D  10( d b) / a (2) The average dissimilarity becomes zero when the sampling depth is large enough to represent an actual community. When d is 0, which means there is no dissimilarity between different sampling datasets, structure of the sampling datasets could represent the structure of the actual community. Therefore, PS  10b / a (3) 4

ACCEPTED MANUSCRIPT where PS is the predicted number of sequences. As the Bray-Curtis dissimilarity (dBC) is widely used to measure the dissimilarity between microbial communities in microbial ecology (Martín et al., 2006; Ge et al., 2013; Zhou et al., 2014; Deng et al., 2016), we chose dBC as the dissimilarity index in the present study.

SC

RI

PT

2.2 Procedure of the theoretical framework in predicting the sampling depth To verify the regression equation between D and dBC, a public dataset (NDMS024, accession number 4571971.3) of ocean water microbial community (Techtmann et al., 2015) was retrieved from MG-RAST (http://metagenomics.anl.gov/). Sequences were pre-processed via QIIME pipeline version 1.9.0 (Caporaso et al., 2010) as previously described (Li et al., 2016). The high-quality sequences were clustered into operational taxonomic units (OTUs) at 97% identity threshold. Random sampling was conducted by QIIME Pipeline Version 1.9.0. The calculation of Bray-Curtis dissimilarity was performed using the vegan package in R version 3.0.1 (Dixon, 2003).

AC

CE

PT E

D

MA

NU

The sequence dataset was treated as entirety. Firstly, three datasets with 10000 sequences were sampled, which simulated the environmental sampling process. Secondly, from each dataset, five subsets with 1000, 3000, 5000, 7000, and 9000 sequences, respectively, were re-sampled. The sampling process was similar to the simulation with an analogous example described by Zhou et al. (2013). Thirdly, the dBC between subsets were calculated with the same sequences and fitted the correction equation. Finally, the sequence numbers of the public dataset NDMS024 were predicted and compared with the true value. A sample pipeline of the method was showed in Figure 2. An R script for running the method was supported by Supplemental file 1.

5

PT E

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 2 Pipeline of the method. NGS, next-generation sequencing; QC, quality control; dBC, Bray-Curtis dissimilarity; D, sampling depth.

AC

CE

2.3 Dataset achieving and analysis To verify the practicability of the theoretical framework in actual microbial community, 3 replicates of sediment samples were collected from a laboratory mesocosm. The mesocosm is a cylindrical container with 30 cm diameter and 60 cm of height. There is a 40 cm of sediment and 10 cm of overlying water in the mesocosm. The mesocosm is used to research the biodegradability of sediment to persistent organic pollutants. Microbial DNAs were extracted according to Fang et al. (2015) and the V4-V5 hypervariable region of 16S rRNA gene was sequenced by an Illumina MiSeq system and pre-processed by QIIME pipeline version 1.9.0 (Caporaso et al., 2010) as previously described (Li et al., 2016). OTU clustering, random sampling, and Bray-Curtis dissimilarity was calculated as described above with the exception of the random sampling depths that were changed to 3000, 6000, 9000, 12000, 15000, and 18000 sequences. In addition, 3 replicates of fecal samples were collected from mouse and analyzed as the above description. All DNA datasets have been submitted to the NCBI Sequence Read Archive database under the accession number SRP090150 (for sediment samples) and SRP107826 (for fecal samples). 3 Results and Discussion 6

ACCEPTED MANUSCRIPT

MA

NU

SC

RI

PT

3.1 Predicted result of the public dataset As shown in Figure 3, the dBC was logarithmically related to the sampling depth. In addition, we also considered a power function (dBC = 0.224e−7D, p < 0.001, R2 = 0.888) and a linear function (dBC = −10−5D + 0.220, p < 0.001, R2 = 0.831) between the dBC and the sampling depth. According to the probability (p) value of the significance test and the regression coefficient R2, the logarithmic function was the best one. According to the correlation equation, dBC = −0.1022·log10(D) + 0.5307 (p < 0.001, R2 = 0.94, Figure 3), and equation (3), PS was predicted to be 155,868, which was very close to the true value, 176,322. The bias was probably caused by the biases of the parameters a and b in the correlation equation and the exponential amplification effect of the equation (3).

Figure 3 Correlation between Bray-Curtis dissimilarity (dBC) of the sampling datasets of a simulated microbial community and the sampling depths (D).

AC

CE

PT E

D

3.2 Taxa retrieval and reliability of their relative abundances with sampling depth As rare taxa have attracted more attention recently (Campbell et al., 2011; Coveley et al., 2015; Hua et al., 2015), their assessment is necessary for us to detect such rare taxa at a certain degree. An overlap among/between replicates (e.g., technical replications) is commonly used to indicate the probability of OTU detection (Zhou et al., 2011; Zhou et al., 2015). Therefore, we used overlap as an indicator to assess the probability of detected OTUs with different relative abundances. The results indicated that OTUs with more than 1% of relative abundance could be detected by sequencing approximately 3.4% sequences of entirety. For the OTUs with relative abundances between 1% and 0.1%, sequencing more than 6.8% sequences of entirety could detect them. In addition, for the OTUs with relative abundance between 0.1% and 0.01%, sequencing depth of more than 10.2% still did not ensure their detection (Figure 4(A)–4(C)). Despite the importance of taxon composition in microbial communities, the relative abundance of each taxon was another important aspect to reveal the structure of microbial communities. Therefore, we also detected which sampling depths could reveal the real relative abundances of each taxon using the public dataset NDMS024. The results indicated that for taxa with more than 1% relative abundance, the sampling depth of 3000 sequences (approximately 3.4% of entirety) could reveal the relative abundances of those taxa. In this case, the average dBC was 0.2263. However, for the taxa with 1%–0.1% relative abundance, even though the sampling depth 7

ACCEPTED MANUSCRIPT

SC

RI

PT

reached 18,000 sequences (approximately 10.2% of entirety), it still could not reveal the real relative abundances of those taxa (Figure 4(D)-4(F)). These results imply that it is hard to describe the relative abundance of rare taxa in a microbial community at the current sequencing depth, i.e., tens of thousands sequences.

CE

PT E

D

MA

NU

Figure 4 Illustration of random sampling processes and their impacts on the analysis of microbial communities. (A), (B), and (C) showed the relationship between the sequence percentage of entirety and the overlap. The increase in taxon overlap along with sampling sequence depth are shown for the taxa with more than 1% (A), 1%–0.1% (B), and 0.1%–0.01% (C) relative abundance. The 100% overlap implies that all OTUs are enumerated at the corresponding sequence percentage of entirety. The datasets were from a laboratory mesocosm and three replicates of 72,933; 26,561; and 19,065 high-quality sequences were obtained. The standard deviations of relative abundances for each taxon among sampling processes are shown with different sampling depths for the taxa with more than 1% (D), 1%–0.1% (E), and 0.1%–0.01% (F) relative abundance. D3000, D6000, D9000, D12000, D15000, and D18000 indicate the sequencing depths were 3000, 6000, 9000, 12000, 15000, and 18000, respectively. The larger the standard deviation, the smaller is the reliability of the OTU relative abundance.

AC

3.3 Predicting the minimum number of sequences required for studying environmental microbial communities In order to verify the utility of our method in determining the representative number of sequences for environmental microbial communities, we predicted the minimum number of sequences of the community from a laboratory mesocosm. Three replicates of 72,933; 26,561; and 19,065 high-quality sequences were sequenced. According to the correlation equation between re-sampling depths and dBC (Figure 5), the minimum number of sequences required to reasonably reflect the composition and structure of the microbial communities was calculated. According to the prediction, the sequencing depth should at least reach 5,528,079 to be representative of the microbial communities from the mesocosms. However, if the composition of rare taxa with more than 1% and 1%–0.1% of relative abundances was considered, 3,000 and 12,000 sequences, respectively, were enough, which is in accordance with a previous study (Caporaso et al., 2011).

8

ACCEPTED MANUSCRIPT

PT

Figure 5 Correlation between Bray-Curtis dissimilarity (dBC) of the sampling datasets of a mesocosm microbial community and the sampling depths (D).

MA

NU

SC

RI

Three fecal samples from three mice (sample names were MN20, MN23, and MN26) were also analyzed to verify the utility of our method in determining the representative number of sequences for environmental microbial communities. Three replicates of 22,993; 24,759; and 38,032 high-quality sequences were sequenced. According to the prediction, the sequencing depth should at least reach 6,898,848 to be representative of the fecal microbiota from the individual mouse (Figure S1(A) and (B)). This was very close to the prediction of the microbiota from the mesocosms (5,528,079). However, the sequencing depth should at least reach 9.82 × 1023 to be exhaustivity of the fecal microbiota from the three mice and two predictions were very close (Figure S1(C) and (D)). This result implied that it was impossibility to exhaustivity of the fecal microbiota from different individuals of the same mammal species as tremendous diversity between different individuals.

AC

CE

PT E

D

If the correlation between a specific alpha-diversity index (such as Shannon’s index or Simpson’s index) of microbiota in a specific habitat and the minimum number of sequences required to reasonably reflect the composition and structure of the microbiota were known, then the alpha-diversity index could be applied to calculate the minimum number of sequences. However, although according to this method, the correlation equations between an alpha-diversity index of microbiota and the minimum number of sequences were fitted theoretically, the suitable microbial community datasets were too small to fit the correlation equations. A prerequisite using the method to calculate the minimum number of sequences that is representative of a microbiota was to sequence at least three replicates. However, until now, very few microbiota have been sequenced for at least three replicates synchronously. Therefore, to more accurately assess the composition and structure of a microbiota, we proposed that at least three replicates are sequenced at relatively low depth (approximately 10,000 sequences) synchronously. Then, the minimum number of sequences required to be reasonably representative of the microbiota for different research purposes be calculated using our method. When the suitable datasets were accumulated, the correlation equations between an alpha-diversity index of microbiota and the minimum number of sequences could be fitted and we could apply the alpha-diversity index to calculate the minimum number of sequences. It is not necessary to calculate the minimum number of sequences of all samples from the same biotope. We recommended choosing some representative samples to calculate the minimum number of sequences. In addition, the minimum number of sequences required to satisfy a specific purpose 9

ACCEPTED MANUSCRIPT could also be calculated according to our method by choosing the corresponding dataset. For instance, if we would like to analyze the structure of dominant OTUs in microbiota, the dBC should be calculated using the dominant OTU dataset.

PT

Technical errors should be seriously considered when using this method to determine the minimum number of sequences. Since most microbiota structures were analyzed via 16S rRNA gene high-throughput sequencing technology, the technical errors, such as those caused by PCR amplification and misrecognition, blocked the analysis of rare OTUs (Quince et al., 2011). Other factors, such as the 16S rRNA gene copy number, also interfered with the analysis of rare OTUs (Kembel et al., 2012; Ni et al., 2016).

SC

RI

In conclusion, we developed a novel method that determines the representativeness of a sampling dataset and predicts the minimum number of sequences required to satisfy a specific purpose. This method provides a criterion to ensure sequencing sufficiency when conducting research that involves the microbial community structure and can help researchers to ensure reliability of microbial community data, as well as to make better decisions and allocate resources more appropriately in microbial ecology studies.

PT E

D

MA

NU

Acknowledgements We would like to thank Ye Deng and Yueni Wu from the Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences for their help to write the R script. This research was supported by the National Natural Science Foundation of China (No. 31500417), the China Postdoctoral Science Foundation (No. 2017M612691), the Guangdong Natural Science Foundation (No. 2014A030310281; 2014A030308019), the Youth Science Foundation of Guangdong Academy of Sciences (No. qnjj201504), the National Science Foundation for Excellent Young Scholars of China (No. 51422803), and the Special Fund for Application Research of Guangdong Province (No. 2015B020235011). Conflict of Interest Statement The authors declare no competing financial interests.

AC

CE

Author Contributions JN conceived the idea. JN, XL, and MX designed the study. JN and XL conducted the experiments, acquired and analyzed the data. JN, MX, and ZH were involved in drafting the manuscript and all authors approved the final version to be published. References Bartram, A.K., Lynch, M.D.J., Stearns, J.C., Moreno-Hagelsieb, G., and Neufeld, J.D. (2011) Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. 77, 3846-3852. doi:10.1128/AEM.02772-10 Bik, E.M., Long, C.D., Armitage, G.C., Loomer, P., Emerson, J., Mongodin, E.F., et al. (2010) Bacterial diversity in the oral cavity of 10 healthy individuals. ISME J. 4, 962-974. doi:10.1038/ismej.2010.30 Brose, U., Martinez, N.D., and Williams, R.J. (2003) Estimating species richness: sensitivity to sample coverage and insensitivity to spatial patterns. Ecology 84, 2364-2377. Campbell, B.J., Yu, L., Heidelberg, J.F., and Kirchman, D.L. (2011) Activity of 10

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

abundant and rare bacteria in a coastal ocean. Proc. Natl. Acad. Sci. USA 108, 12776-12781. doi:10.1073/pnas.1101405108 Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nat. Methods7, 335-336. doi:10.1038/nmeth.f.303 Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Lozupone, C.A., Turnbaugh, P.J., et al. (2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl. Acad. Sci. USA 108, 4516-4522. doi:10.1073/pnas.1000080107 Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Huntley, J., Fierer, N., et al. (2012) Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621-1624. doi:10.1038/ismej.2012.8 Coveley, S., Elshahed, M.S., and Youssef, N.H. (2015) Response of the rare biosphere to environmental disturbance in a highly diverse ecosystem (Zodletone spring, OK, USA). PeerJ 3:e1182. doi:10.7717/peerj.1182 Curtis, T.P., Sloan, W.T. and Scannell, J.W. (2002) Estimating prokaryotic diversity and its limits. Proc. Natl. Acad. Sci. USA 99, 10494-10499. Daniel, R. (2005) The metagenomics of soil. Nat. Rev. Microbiol. 3, 470-478. Deng, C., Daley, T.,and Smith, A. (2015) Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol. 3, 135-144. doi:10.1007/s40484-015-0049-7 Deng, Y., He, Z., Xiong, J., Yu, H., Xu, M., Hobbie, S.E., et al. (2016) Elevated carbon dioxide accelerates the spatial turnover of soil microbial communities. Glob. Chang. Biol. 22, 957-964. doi:10.1111/gcb.13098 Dixon, P. (2003) VEGAN, a package of R functions for community ecology. J. Veg. Sci. 14, 927-930. Dove, A.D.M. and Cribb, T.H. (2006) Species accumulation curves and their applications in parasite ecology. Trends Parasitol. 22, 568-574. Fang, Y., Xu, M., Chen, X., Sun, G., Guo, J., Wu, W., et al. (2015) Modified pretreatment method for total microbial DNA extraction from contaminated river sediment. Front. Environ. Sci. Eng. 9,444-452. doi:10.1007/s11783-014-0679-4 Gans, J., Wolinsky, M., and Dunbar, J. (2005) Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387-1390. Ge, Y., Schimel, J.P., and Holden, P.A. (2014) Analysis of run-to-run variation of bar-coded pyrosequencing for evaluating bacterial community shifts and individual taxa dynamics. PLoS ONE 9:e99414. doi:10.1371/journal.pone.0099414 Gotelli, N.J. and Colwell, R.K. (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 4, 379-391. Haegeman, B., Hamelin, J., Moriarty, J., Neal, P., Dushoff, J., and Weitz, J.S.(2013) Robust estimation of microbial diversity in theory and in practice. ISME J. 7, 1092-1101. doi:10.1038/ismej.2013.10 Harris, J.K., Sahl, J.W., Castoe, T.A., Wagner, B.D., Pollock, D.D., and Spear, J.R. (2010) Comparison of normalization methods for construction of large, multiplex amplicon pools for next-generation sequencing. Appl. Environ. Microbiol. 76, 3863-3868. doi:10.1128/AEM.02585-09 Horner-Devine, M.C., Carney, K.M., and Bohannan, B.J.M. (2004) An ecological perspective on bacterial biodiversity. Proc. Biol. Sci. 271, 113-122. Hortal, J., Borges, P.A.V., and Gaspar, C. (2006) Evaluating the performance of species richness estimators: sensitivity to sample grain size. J. Anim. Ecol. 75, 11

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

274-287. Hua, Z.S., Han, Y.J., Chen, L.X., Liu, J., Hu, M., Li, S.J., et al. (2015) Ecological roles of dominant and rare prokaryotes in acid mine drainage revealed by metagenomics and metatranscriptomics. ISME J. 9, 1280-1294. doi:10.1038/ismej.2014.212 Hughes, J.B., Hellmann, J.J., Ricketts, T.H., and Bohannan, B.J.M. (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol. 67, 4399-4406. Huse, S.M., Dethlefsen, L., Huber, J.A., Mark Welch, D., Relman, D.A., and Sogin, M.L. (2008) Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 4:e1000255. doi:10.1371/journal.pgen.1000255 Kembel, S.W., Wu, M., Eisen, J.A., and Green, J.L. (2012) Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput. Biol. 8: e1002743. doi:10.1371/journal.pcbi.1002743 Li, H., Qu, J., Li, T., Li, J., Lin, Q., and Li, X. (2016) Pika population density is associated with the composition and diversity of gut microbiota. Front. Microbiol. 7: 758. doi: 10.3389/fmicb.2016.00758 Li, J., Ni, J., Li, J., Wang, C., Li, X., Wu, S., et al. (2014) Comparative study on gastrointestinal microbiota of eight fish species with different feeding habits. J. Appl. Microbiol. 117, 1750-1760. doi:10.1111/jam.12663 Lundin, D., Severin, I., Logue, J.B., Östman, Ö., Andersson, A.F., and Lindström, E.S. (2012) Which sequencing depth is sufficient to describe patterns in bacterial α- and β-diversity? Environ. Microbiol. Rep. 4, 367-372. doi:10.1111/j.1758-2229.2012.00345.x Mao, C.X., Colwell, R.K., and Chang, J. (2005) Estimating the species accumulation curve using mixtures. Biometrics 61, 433-441. Martín, H.G. and Goldenfeld, N. (2006) Estimation of microbial cover distributions at Mammoth Hot Springs using a multiple clone library resampling method. Environ. Microbiol. 8, 1145-1154. Melo, A.S. and Froehlich, C.G. (2001) Evaluation of methods for estimating macroinvertebrate species richness using individual stones in tropical streams. Freshw. Biol. 46, 711-721. Metcalf, J.L., Xu, Z.Z., Weiss, S., Lax, S., Van Treuren, W., Hyde, E.R., et al. (2016) Microbial community assembly and metabolic function during mammalian corpse decomposition. Science 351, 158-162. doi:10.1126/science.aad2646 Møller, A.K., Søborg, D.A., Al-Soud, W.A., Sørensen, S.J., and Kroer, N. (2013) Bacterial community structure in High-Arctic snow and freshwater as revealed by pyrosequencing of 16S rRNA genes and cultivation. Polar Res. 32:17390. doi:10.3402/polar.v32i0.17390 Mou, X., Jacob, J., Lu, X., Robbins, S., Sun, S., Ortiz, J.D., et al. (2013) Diversity and distribution of free-living and particle-associated bacterioplankton in Sandusky Bay and adjacent waters of Lake Erie Western Basin. J. Gt. Lakes Res. 39, 352-357. doi:10.1016/j.jglr.2013.03.014 Ni, J., Xu, M., He, Z., Sun, G., and Guo, J. (2016) Novel insight into evolutionary process from average genome size marine bacterioplanktonic biota. Appl. Ecol. Environ. Res. 14, 65-75. doi:10.15666/aeer/1402_065075 Quince, C., Lanzen, A., Davenport, R.J., and Turnbaugh, P.J. (2011) Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12: 38. doi:1186/1471-2105-12-38 12

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

Roesch, L.F.W., Fulthorpe, R.R., Riva, A., Casella, G., Hadwin, A.K., Kent, A.D., et al. (2007) Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J. 1, 283-290. Techtmann, S.M., Fortney, J.L., Ayers, K.A., Joyner, D.C., Linley, T.D., Pfiffner, S.M., et al. (2015) The unique chemistry of Eastern Mediterranean water masses selects for distinct microbial communities by depth. PLoS ONE 10: e0120605.doi:10.1371/journal.pone.0120605 Tremblay, J., Singh, K., Fern, A., Kirton, E.S., He, S., Woyke, T., et al. (2015) Primer and platform effects on 16S rRNA tag sequencing. Front. Microbiol. 6:771. doi:10.3389/fmicb.2015.00771 Torsvik, V., Øvreås, L., and Thingstad, T.F. (2002) Prokaryotic diversity - magnitude, dynamics, and controlling factors. Science 296, 1064-1066. Walther, B.A. and Morand, S. (1998) Comparative performance of species richness estimation methods. Parasitology 116, 395-405. Youssef, N., Sheik, C.S., Krumholz, L.R., Najar, F.Z., Roe, B.A., and Elshahed, M.S. (2009) Comparison of species richness estimates obtained using nearly complete fragments and simulated pyrosequencing-generated fragments in 16S rRNA gene-based environmental surveys. Appl. Environ. Microbiol. 75, 5227-5236. doi:10.1128/AEM.00592-09 Zhou, J., Deng, Y., Zhang, P., Xue, K., Liang, Y., Van Nostrand J.D., et al. (2014) Stochasticity, succession, and environmental perturbations in a fluidic ecosystem. Proc. Natl. Acad. Sci. USA 111, E836-E845. doi:10.1073/pnas.1324044111 Zhou, J., He, Z., Yang, Y., Deng, Y., Tringe, S.G., and Alvarez-Cohen, L. (2015) High-throughput metagenomic technologies for complex microbial community analysis: open and closed formats. mBio 6:e02288-14. doi:10.1128/mBio.02288-14 Zhou, J., Jiang, Y.H., Deng, Y., Shi, Z., Zhou, B.Y., Xue, K., et al. (2013) Random sampling process leads to overestimation of β-diversity of microbial communities. mBio 4, e00324-13. doi:10.1128/mBio.00324-13 Zhou, J., Wu, L., Deng, Y., Zhi, X., Jiang, Y.H., Tu, Q., et al. (2011) Reproducibility and quantitation of amplicon sequencing-based detection. ISME J. 5, 1303-1313. doi:10.1038/ismej.2011.11

13

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

Graphical abstract

14

ACCEPTED MANUSCRIPT HIGHLIGHTS

Minimum sequencing depth to obtain a reliable description of microbial community could be determined. The method was based on the correlation equation between dissimilarity of resampling datasets and resampling depths.

AC

CE

PT E

D

MA

NU

SC

RI

PT

The method provides a criterion to ensure sequencing sufficiency of microbial communities.

15