C H A P T E R
11 Metagenomics Isabel Moreno-Indias1,2, Francisco J. Tinahones1,2 1
Unidad de Gestio´n Clı´nica de Endocrinologı´a y Nutricio´n, Laboratorio del Instituto de Investigacio´n Biome´dica de Ma´laga (IBIMA), Hospital Universitario de Ma´laga (Virgen de la Victoria), Ma´laga, Spain; 2Centro de Investigacio´n Biome´dica en Red de Fisiopatologı´a de la Obesidad y la Nutricio´n (CIBEROBN), Madrid, Spain
BACKGROUND
used to identify and analyze the population of microorganisms were necessary. Traditionally, bacteria were studied by culture-based methods, but the difficulty of culturing most anaerobically living commensal gut microorganisms led to the development of cultureindependent methods. This rapid development and, most important, the substantial cost reduction of nextgeneration sequencing (NGS) or high-throughput sequencing have revolutionized the field of microbial ecology. Thus, this technology has led to the establishment of the field of metagenomics. Metagenomics refers to the direct genetic analysis of the genomes contained within a sample: that is, the functional and sequencebased analysis of the collective genomes of a sample. However, many researchers have extended this general term within microbiota studies to the study of certain genes of interest, such as the 16S ribosomal DNA gene. The powerful combination of genome sequencing and bioinformatics-driven analysis of sequence data has transformed our knowledge about how bacteria function, evolve, and interact with each other, with their hosts, and with the environment, providing new avenues of inquiries and advances for translational impact.
Each human intestine harbors trillions of bacteria as well as other microorganisms such as bacteriophages, viruses, fungi, and archaea, which constitute a complex and dynamic ecosystem living in symbiosis with humans throughout our lifetime. This ecosystem is referred to as gut microbiota, although the term “gut microbiome” is also used without distinction. Bacteria represent almost 99% of this gut microbiota, so most studies focus on these microorganisms. The gut microbiota is related to the promotion of health and the initiation and maintenance of different gastrointestinal and non-gastrointestinal diseases. In this manner, human-associated microbial communities are implicated in a wide range of diseases. Changes in the composition of human gut microbiota are associated with the development of chronic diseases including type 2 diabetes, obesity, and colorectal cancer, among others (Young, 2017). Gut microbiota encompassing trillions of bacteria is considered a microbial organ with a key role within the human organism. Members of the microbiota communicate among themselves and exchange signals with the host’s systems, participating in the body homeostasis. Gut microbiota takes part in multiple functions within the body: in the metabolism with roles in synthesizing vitamins, digesting compounds, fermenting nondigestible substances, producing energy, protecting functions in stimulating the immune system, antimicrobial effects, creating a physical barrier against pathogenic bacteria, and so on. Along this line, study of the gut microbiome community has led to important research and renewed scientific efforts to understand the whole functionality of the human body. The importance of the gut microbiota was hypothesized a long time ago, but advances in the technology
Principles of Nutrigenetics and Nutrigenomics https://doi.org/10.1016/B978-0-12-804572-5.00011-2
HISTORY OF METHODS USED Advances in technology have primarily driven the current revolution in microbiology, particularly in the study of the gut microbiota. The development of culture-independent methods has provided the scientific community with the chance to deal with new hypothesis and paradigms. Once the culture methods were not a consideration, molecular methods burst onto the scene, helped by the design of universal 16S ribosomal RNA gene
81
Copyright © 2020 Elsevier Inc. All rights reserved.
82
11. METAGENOMICS
polymerase chain reaction (PCR) primers for the analysis of microbes. Thus, the use of quantitative PCR and the development of arrays with these primers kicked off microbiota studies. When the semiquantitative technique of denaturing gradient gel electrophoresis was shown to be an effective method for comparing microbiota compositions, it became the reference standard for the rapid initial screening of bacterial communities of initial unknown composition before NGS methods were completely developed. Thus, high-throughput sequencing or NGS reached bacteriology in the first decade of the 2000s, leading to multiple platforms for scientists (Kozi nska et al., 2019), accompanied by new bioinformatics approaches. Briefly, the first of the technologies that revolutionized genomics and metagenomics was pyrosequencing, led by the 454-sequencing platform from Roche. The principle of this technology was simple: a one-by-one nucleotide addition cycle, in which the pyrophosphate released from the DNA polymerization reaction is transformed in a luminous signal detected by the machine and translated into nucleotide sequences. However, despite the revolutionary nature of this technology for metagenomics, it is now obsolete. An analogous technology was the ion torrent platform. The Ion Torrent Personal Genome Machine, and the Ion S5 (from Thermo Fisher Scientific) are able to detect changes in hydrogen potential generated each time a proton is released after a nucleotide is added in the sequencing reaction. However, read length reduction is another approximation in some platforms to reduce sequencing costs. This is the case of Illumina technology (Illumina, Inc.), which has become one of the most popular technologies owing to its low cost and high yield. The basis of Illumina chemistry is reversible-termination sequencing by synthesis with fluorescently labeled nucleotides. DNA fragments are attached and distributed in a flow cell, in which the sequencing reaction occurs by adding a labeled nucleotide. When a labeled nucleotide is incorporated into the samples, a laser excites its fluorescent molecule and the machine registers the signal. Although these technologies are most commonly used for metagenome projects, the development of sequencing has continued to solve the known biases of these strategies and to offer a better trade-off among yield, cost, and read length. This is the case of other platforms such as PacBio (developed by Pacific Biosciences). PacBio is based on single-molecule real-time (SMRT) sequencing and is classified as third-generation sequencing. SMRT sequencing is built on two important innovations: zero-mode waveguides, which allow light to illuminate only the bottom of a well in which a DNA polymeraseetemplate complex is immobilized, and phospholinked nucleotides, which enable an examination of the immobilized complex as the DNA
polymerase produces a completely natural DNA strand. PacBio sequencing offers much longer read lengths and faster runs than the other methods, but it is hindered by a lower throughput, higher error rate, and higher cost per base. Finally, the latest developed technology is Nanopore sequencing (Oxford Nanopore Technologies), with the portable device MinION, which provides long reads and fits in a pocket. Nanopore technology is based on nanoscale holes. DNA sequencing is based on the passage of a strand of DNA through the nanopore. The current is changed as the bases pass through the pore in different combinations. Data obtained from second- or third-generation sequencing technologies have certain computational requirements for their analysis. The bigger the dataset generated, the higher computational resources and more complex bioinformatics analyses are necessary. Moreover, the strengths and weaknesses of secondand third-generation sequencing are complementary, which motivates researchers to integrate both techniques: the use of high-throughput and high-accuracy short read data from the second sequencing generation to correct errors in the long reads of the third sequencing generation, to reduce the required amount of the costly long-read sequence data and salvage the relatively long but more error-prone subreads.
METAGENOMIC APPROACHES Experimental design is the most important part in a scientific study; it should fit each project’s objectives to answer the biological question hypothesized. Different approaches permit contrasting results and it is important to keep in mind to choose the best approach. At this point, two metagenomics approaches for studying of microbiota will be introduced according to the target of sequencing.
Targeted Methods The sequencing of specific microbial amplicons, mostly the bacterial 16S ribosomal RNA gene, has been extensively used for gut microbiota studies. Indeed, it is the technique that has most contributed to the establishment of the science of gut microbiota. Although this amplicon-based sequencing considers only one microbial gene, it is frequently grouped under the umbrella of metagenomics as a way to perform taxonomic, phylogenetic, or functional profiling. Amplicon sequences can be directly matched to reference taxa or, more frequently, first grouped into clusters sharing a fixed level of sequence identity (often 97% or 99%) and known as operational taxonomic units
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY
WORKFLOW IN METAGENOMIC STUDIES
(OTUs). These OTUs are often more specific than genera, but in most cases they are less specific than species. Methods are able to control errors that amplicons can be clustered according to single-nucleotide differences over the sequenced gene region without constructing OTUs, with better resolution of the posterior taxonomy assignment. In this manner, sequences are clustered into amplicon sequence variants (ASVs). In either case, individual reads, OTUs, or ASVs are then assigned to specific taxa based on the sequence homology to a reference genomic sequence, a process known as binning. Thus, the main advantage of this approach is that the 16S ribosomal RNA (rRNA) gene provides the availability of several large databases of reference sequences and taxonomies, such as greengenes, SILVA, and the Ribosomal Database (RBD) Project. However, the species-level resolution might be unfeasible using this information alone, achieving the family or genus levels as the finest taxonomic levels in most cases. Targeted approaches such as this were first restricted to taxonomic and phylogenetic profiling. However, the development of bioinformatics software has added functional profiling to this approach. The most wellknown method is Phylogenetic Investigation of Communities by Reconstruction of Unobserved States, a bioinformatics software package designed to predict the metagenome functional content from targeted amplicons such as the 16S rRNA gene surveys (although it may also be used for full genomes). Moreover, the use of strategies to refine results could improve the disadvantages of this technique. Besides the mentioned ASVs, strategies such as oligotyping have been proposed to improve OTU resolution, using a sequence entropy-based approach to identify maximally informative sites within the 16S rRNA gene.
83
WGS is able to profile the bacterial community even to the strain level, owing to its ability to identify variation throughout microbial genomes. The genomic content of a community describes its functional potential, because of the knowledge of genes contained in each individual. Advantages of WGS sequencing are the ability to provide information on genome assembly, specieslevel taxonomy abundance, gene predication, and metabolic pathway reconstruction. However, each stage of the analysis is complicated by incomplete coverage, the high volume of data, the short length of reads, and intrinsic errors caused by parallelism sequencing. Thus, the data obtained will be able to decipher presence in the community (taxonomic profiling), as the 16S approach does, although with better accuracy, and the profile of the genes present in that population (functional profiling). This approach is ideal for novel discovery. Many new genes, phylotypes, regulators, and/or pathways are discovered using shotgun metagenome sequencing. The combination of the lower cost and the decreased resolution of amplicon sequencing with the higher cost and increased resolution of WMS sequencing enables richer experimental designs. Along this line, two-stage study designs are a good option: beginning by surveying a large number of individuals using amplicon sequencing and then following up with a subset of samples (i.e., representative individuals of the group) using WMS sequencing. Another option could be time-course studies that combine amplicon sequencing, to survey a large number of time points, with WMS sequencing, to analyze a subset of time points (such as the first and last) in greater detail.
WORKFLOW IN METAGENOMIC STUDIES Untargeted Methods Despite advances in amplicon sequencing, wholegenome shotgun (WGS) sequencing is the preferred method for studying microbial communities. In a different approach to amplicon sequencing, shotgunsequencing methods are applied to millions of random genomic fragments sampled from a microbial community. Assembly does not reconstruct entire genomes, so it is necessary to bin genome fragments. Thus, assembly is the process of combining sequence reads into contiguous fragments of DNA called contigs, based on sequence similarity between reads. Then, the presence of paired reads in two different contigs allows those contigs to be linked into a larger noncontiguous DNA sequence called a scaffold. The goal of WGS is to represent each genomic sequence in one scaffold, although this is not always possible.
For amplicon sequencing and whole metagenome sequencing, the workflow is similar (Song et al., 2018). Although these suggestions may be inferred for the rest of metagenome studies, we will focus on gut microbiome studies (Fig. 11.1): 1. Processing of biological samples. As in any other study, the experimental design is the critical point of the results. Selection of the least invasive method for the patient in collecting the samples will ensure the subject’s cooperation during the whole study, especially if it will be a longitudinal study. Fortunately, in gut microbiota studies, feces are a good representative sample for colonic microbiota. a. Sample collection: There are plenty of commercial products for collecting fecal samples. However, the critical step for gut
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY
84
11. METAGENOMICS
Hypothesis Sample Sample-processing
Sequencing Targeted sequencing (16S rRNA)
Untargeted sequencing (whole shotgun sequencing) Data assembly
OTUs clustering
Gene prediction Taxonomic identification Functional and Pathway inferences
Taxonomic identification
Multivariate analysis
Functional and Pathway annotation
Combination with other omics data
FIGURE 11.1 Workflow for microbiome analysis. OTUs, operational taxonomic units; rRNA, ribosomal RNA.
microbiota analysis is the correct storage of samples. Thus, freezing the sample at 80 C within 2 hours after sample collection is recommended. However, if this cannot be done, there are some collection commercial kits on the market with stabilizers that permit storage even at room temperature. b. DNA extraction: Variations in DNA extraction methods can have dramatic impacts on the results of metagenomic studies, especially in highdiversity communities such as those in feces. Although there is much controversy regarding the best way to extract DNA, most extraction methods use the same basic steps: cellular lysis, removal of non-DNA macromolecules, and collection of DNA. The researcher should choose the method that best fits with the outcome. 2. Sequencing: After the successful extraction of DNA, the proper selection of the sequencing approach, targeted or nontargeted, and the type of sequencer will depend on the goal of the researcher. Surprisingly, commonly used sequencers have different error rates and patterns, but their effects on taxonomic and functional composition are minimal. Balancing sequencing depth with the number of samples per sequencing run depends on the biological question and the complexity of the community. The sampling effort generally depends on variations in the microbial communities to be compared. For communities that share great similarity, deeper sequencing is needed to distinguish treatment effects on microbial communities.
3. Bioinformatics data analysis tools: The large amount of data produced in the high-throughput analysis requires sophisticated analysis tools. Thus, direct manipulation of the data is no longer feasible. Standardization and metadata access are imperative to reduce the magnitude of different experimental protocols and data analysis methods. However, the magnitude of these effects is often small relative to biological variations. a. For targeted amplicon sequences such as 16S rRNA, multiple friendly tools are available for nonexpert bioinformatics researchers. Of those, Quantitative Insights Into Microbial Ecology (QIIME) and its new release, QIIME2, as well as Mothur are among the most used because of the good support they offer, with multiple tutorials and good and quick feedback from the developers. However, in large laboratories, the existence of bioinformaticians who implement and automatize their own pipelines is necessary. b. For whole-metagenome studies, the pipeline is complicated. Early attempts at metagenomic data assemblies used tools initially implemented for single-genome data assemblies. However, assembly tools have significantly evolved since then and multiple tools have been specifically designed to assemble samples containing multiple genomes. Gene prediction, taxonomic identification, functional and pathway annotation, and finally data analysis will be assessed using different strategies (see Roumpeka et al. (2017) for a resume of available software; every month, new programs are being developed).
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY
85
PERSPECTIVES
One of the greatest challenges in metagenomics is identifying significantly different features between communities. The large number of available information requires good knowledge about statistical tests. Moreover, it may be difficult to understand the biological importance of results. Thus, multidisciplinary teams make it possible to deal with the biological hypothesis in a deeper and more reliable way.
OTHER MICROORGANISMS Eukaryotes have important roles in almost all ecological niches on the earth; presumably, this also happens within the human body. However, the study of these organisms by NGS techniques has been challenging because they are not well-described in sequence databases. This lack of reference eukaryotic genomes is due in part to the difficulty of their genome assembly and annotation. Moreover, the low abundance of eukaryotes compared with bacteria, as well as avoiding amplification of the DNA of the host, are other challenges. However, targeted approaches have used the 18S rRNA gene in a model similar to that of the 16S rRNA gene, and some of the software used for 16S rRNA gene analysis, such as QIIME, included tools to analyze these amplicons. Viruses are also present in the human microbiome. A typical healthy human carries an abundance of viral particles at around 1012, consisting mainly of bacteriophages and, to a minor extent, eukaryotic viruses. Similar to bacteria, the virome of adults seems to be stable over time. However, the lack of a gene that is universally present in all viruses means that amplicon-based studies cannot be used to characterize the virome. Notwithstanding, bacteriophages have critical roles in the microbiome ecosystem, and more efforts should be made to study them.
CURRENT GUT MICROBIOTA STATUS The amount of variability in the microbiota within human subjects is immense; big projects such as the Human Microbiome Project and MetaHIT as well as smaller ones have done much to define this variability. However, causality has been reached in only a few cases. Germfree mice, mice without microorganisms living in them, represent a model system to study the effect of gut microbes on host physiology. Experimental manipulations of murine models in gut microbiota research allow functional and mechanistic research into hoste microbe interactions, which helps to assess causality in disease-associated alterations in gut microbiota
composition. Thus, this mouse model is increasingly being used to study the role and function of gut microbiota and its association with diseases. Humanized gnotobiotic mice, which result from the inoculation of a human gut microbiota sample into germ-free mice, provide a powerful tool for gut microbiota studies because these models can recapitulate a large part of the human gut microbiota phylogenetic composition (Lundberg et al., 2016). This approach has been widely employed in many studies because it allows perturbations in a human-like system and is considered to be the reference standard for confirming associations and trying to prove causality in gut microbiota research. However, the hostemicrobe relationships in these humanized models do not necessarily reflect the real ones seen in humans, because the gut microbiota is transplanted into a host with which it has not coevolved. Fecal microbiota transplants (FMTs) are the standard for revealing causality in humans. FMT is defined as the administration of a solution of fecal matter from a donor into the intestinal tract of a recipient to change the recipient’s microbial composition directly and confer a health benefit (Cammarota et al., 2017). A classic example is the case of persistent Clostridium difficile infections, for which this treatment reaches an average of 90% cure rate. However, much attention is being paid to other conditions, such as inflammatory bowel disease and even metabolic diseases. The process is simple and usually involves selecting a donor without a family history of autoimmune, metabolic, and malignant diseases and screening for potential pathogens. The feces are then prepared by mixing with water, normal saline, or any other buffer, followed by a filtration step to remove particulate matter. The mixture can be administered through the upper digestive tract (nasogastric tube, nasojejunal tube, or esophagogastroduodenoscopy) or the lower digestive tract (colonoscopy or retention enema). Moreover, noninvasive approaches have been tested, such as using frozen or lyophilized pills. Although it is still considered an experimental treatment and its effectiveness must be confirmed in every disease, its low cost, low risk, and high effectiveness positions FMT as a good therapy because of its repercussions in the gut microbiota.
PERSPECTIVES As demonstrated earlier, DNA sequences are plenty of information. However, although the genomic content of a community describes what the community is capable of doing, it does not provide a direct measure of its community functional activity in a particular
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY
86
11. METAGENOMICS
time; most important, it is unable to discern between living, functionally active and dead cells. To understand fully the determinants of function, additional multiomic data types such as transcriptomics, proteomics, and metabolomics are needed. Each eomics technology has its strengths and weaknesses and must be selected based on the biological questions and objectives of the study. Metatranscriptomics informs us of genes that are expressed by the community as a whole. Thus, metatranscriptomics sheds light on the active functional profile of a microbial community; in other words, the metatranscriptome provides a picture of the gene expression in a given sample at a given moment and under specific conditions by capturing the total messenger RNA (mRNA). After the rRNA is removed, microbial mRNA is converted into cyclic DNA (cDNA) and sequenced by standard methods. With appropriate barcoding of DNA and cDNA samples, metagenomic and metatranscriptomic sequencing could be carried out together. Combining metagenomics with metatranscriptomics can also reveal changes in functional activity in response to interventions, such as those with diets or drugs. Metaproteomics is the study of all proteins expressed by organisms within an ecosystem at a specific time. Measuring protein abundance provides a more direct measure of the functional activity of a cell or community. Protein abundance can be determined in a highthroughput manner using next-generation proteomics (metaproteomics in the microbiome context). Proteomic methods rely on mass spectrometryebased shotgun quantification of peptide mass and abundance. Metaproteome datasets reveal information about microbial community structure, dynamics, and functional activities that are important for a better understanding of various community aspects, such as microbial recruiting, how participating organisms cooperate and compete for nutrient resources, and how these organisms distribute metabolic activities across the community. Metabolomics is another rapidly expanding field of gut microbiota research that evaluates metabolites and other small molecules associated with the interrelationship of hostebacterial metabolism that has implications for health and disease. Metabolomics relies on chromatography techniques (such as high-performance liquid chromatography) to separate compounds, which are then identified and quantified using mass spectrometry. Short chain fatty acids provide a good example of the importance of metabolites in microbiotaehost interactions. Indeed, the use of multiple eomic data types better captures the functional activity of a community. The
addition and combination of multiple forms of eomic data (transcriptomics, proteomics, and/or metabolomics) with metagenomics in an integrated framework will improve the mechanistic models of the microbial community structure and function. Future metagenomics studies should go in this direction.
CONCLUSIONS AND REMARKS Interest in the human microbiome has increased considerably. A significant driver has been the confirmation of some of important roles of the commensal microorganisms that comprise the human microbiota, which are not simply passengers in the host. However, the real driver has been available methodologies, mainly NGS methods, which have made metagenomic analysis more affordable for small to medium laboratories. However, compared with targeted 16S-based metagenomic sequencing, WGS generates exponentially more sequences that necessitate large storage requirements and produce large numbers of unknown species that demand more computational resources. In addition, metagenomic sequencing faces a crucial limitation: it is unable to distinguish between livingfunctional and dead bacteria. Longitudinal analysis could be advantageous for studying microbial community perturbations. Moreover, because microbiome science is growing rapidly, introduction of the study of new members such as viruses, bacteriophages, and fungi is convenient. On the other hand, an additional integrated framework of multiomic data could be (and should be) the future of microbiome analysis, the best approach to describe a microbial community fully, such as data on the levels of community RNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics). Rapid and continuous advances of molecular high-throughput technologies and highperformance computational tools ensure that researchers will be able to model and predict the behaviors of microbial communities in the near future (Nayfach and Pollard, 2016). Nevertheless, massive amounts of novel data on hostemicrobe interactions are coming to light. This complexity of metagenomic analyses is a fundamental limitation for explaining the microbiome roles within the host. Despite this, numerous health outcomes have been explained, which have been used for clinical diagnosis and treatment. Although there is enthusiasm about the opportunities that gut microbiota presents for new diagnostics, prognostics, and therapeutics, more studies need to be carried out.
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY
REFERENCES
SUPPLEMENTARY DATA Supplementary data related to this article can be found online at https://doi.org/10.1016/B978-0-12804572-5.00011-2
Acknowledgments This work was supported by a grant from the Fondo de Investigacio´n Sanitaria of Instituto de Salud Carlos III and co-founded by Fondo Europeo de Desarrollo RegionaleFEDER (PI15/01114). Isabel Moreno-Indias acknowledges support from the “Miguel Servet Type I” Program (CP16/00163).
References Cammarota, G., Ianiro, G., Tilg, H., Rajilic-Stojanovic, M., Kump, P., Satokari, R., Sokol, H., Arkkila, P., Pintus, C., Hart, A., Segal, J., Aloi, M., Masucci, L., Molinaro, A., Scaldaferri, F., Gasbarrini, G., Lopez-Sanroman, A., Link, A., de Groot, P., de Vos, W.M.,
87
Ho¨genauer, C., Malfertheiner, P., Mattila, E., Milosavljevic, T., Nieuwdorp, M., Sanguinetti, M., Simren, M., Gasbarrini, A., 2017. European consensus conference on faecal microbiota transplantation in clinical practice. Gut 66 (4), 569e580. https://doi.org/ 10.1136/gutjnl-2016-313017. Kozi nska, A., Seweryn, P., Sitkiewicz, I., 2019. A crash course in sequencing for a microbiologist. J Appl Genet 60 (1), 103e111. https://doi.org/10.1007/s13353-019-00482-2. Lundberg, R., Toft, M.F., August, B., Hansen, A.K., Hansen, C.H.F., 2016. Antibiotic-treated versus germ-free rodents for microbiota transplantation studies. Gut Microb 7, 68e74. Nayfach, S., Pollard, K.S., 2016. Toward accurate and quantitative comparative metagenomics. Cell 166, 1103e1116. Roumpeka, D.D., Wallace, R.J., Escalettes, F., Fotheringham, I., Watson, M., 2017. A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front Genet 8, 23. Song, E.J., Lee, E.S., Nam, Y.D., 2018. Progress of analytical tools and techniques for human gut microbiome research. J Microbiol 56, 693e705. Young, V.B., 2017. The role of the microbiome in human health and disease: an introduction for clinicians. BMJ 356, j831.
I. THE BIOLOGICAL BASIS OF HERITABILITY AND DIVERSITY