Bioinformatics and genomic databases

Bioinformatics and genomic databases

Handbook of Clinical Neurology, Vol. 147 (3rd series) Neurogenetics, Part I D.H. Geschwind, H.L. Paulson, and C. Klein, Editors https://doi.org/10.101...

752KB Sizes 1 Downloads 102 Views

Handbook of Clinical Neurology, Vol. 147 (3rd series) Neurogenetics, Part I D.H. Geschwind, H.L. Paulson, and C. Klein, Editors https://doi.org/10.1016/B978-0-444-63233-3.00007-5 Copyright © 2018 Elsevier B.V. All rights reserved

Chapter 7

Bioinformatics and genomic databases JASON CHEN AND GIOVANNI COPPOLA* Interdepartmental Program in Bioinformatics and Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, CA, United States

Abstract High-throughput, low-cost sequencing technologies have begun to yield new insights into biology and medicine. New data enable the interrogation of the molecular biology of disease from DNA to RNA to protein, charting the central dogma. This chapter reviews some of the key advances and resources in the application of bioinformatics to understanding, and ultimately diagnosing and treating, diseases of the nervous system. Array genotyping, exome sequencing, and whole-genome sequencing, in both disease and healthy populations, have enabled the interpretation of new genetic data. Profiling of epigenetic markers, such as histone modifications, has added to our understanding of the regulatory machinery of the genome. Downstream, mRNA, and protein expression data from published experiments and highthroughput studies enable complex analyses of gene function across many experimental conditions and tissues. Further delineation of molecular mechanism arises from the concept of genes working together in pathways or networks, reflecting direct protein interactions and regulatory relationships. The rapidly moving field of bioinformatics has made significant contributions to neurology in these early days; continued advances promise to transform medicine from basic science to clinical practice, as more genomics data are generated, combined, and analyzed in the future.

INTRODUCTION It is remarkable to imagine that consciousness, cognition, and other neurologic functions are ultimately encoded in a simple nucleic acid polymer, DNA. The central dogma of molecular biology describes the expression of genetic information encoded by DNA in the form of RNA intermediates, which themselves serve as templates for proteins that maintain cellular structure, catalyze biochemical reactions, and perform other crucial cellular functions (Fig. 7.1). In the six decades since James Watson and Francis Crick first described the double-helix structure of DNA, our understanding of the molecular biology of the cell has rapidly progressed and the central dogma has evolved. In slightly more than a decade after the Human Genome Project, genome sequencing at an individual level is becoming routine as costs rapidly decrease.

An array of high-throughput technologies for measuring RNA intermediates and epigenetic markers (such as DNA methylation and histone modifications) are widely available and have been conducted in a large variety of cell types. The prodigious scale of these genetic and genomic data requires storage in specialized databases, and analysis using the methods of bioinformatics and computational biology. This chapter surveys a number of genomic databases at each of the levels outlined in the central dogma, and describes their application to basic science and clinical practice. Descriptions of each database, as well as potential applications and pitfalls, are summarized in Table 7.1. Current uses are centered in research but over time have begun to translate into clinical applications, such as interpretation of genetic testing and pharmacogenomics. As more genomics data become available and connected

*Correspondence to: Giovanni Coppola, MD, 3506 Gonda, 695 Charles Young Drive South, Los Angeles CA 90095, United States. Tel: +1-310-794-4172, E-mail: [email protected]

76

J. CHEN AND G. COPPOLA

Fig. 7.1. An expanded “central dogma” model describes the flow of genetic information in biologic organisms and outlines various types of databases. Protein-coding and regulatory DNA sequences, as well as epigenetic modifications, control the nature and quantity of transcribed mRNA. Proteins are translated from the mRNA, and in combination with other proteins (directly or indirectly interacting), participate in biologic processes that sustain life. Genomics databases house experimental data from each of the described phases.

to specific pathologies, advances in bioinformatics promise the inevitable transformation of medical practice.

DNA A draft sequence of the entire human genome was determined by the Human Genome Project and published in 2001 (International Human Genome Sequencing Consortium et al., 2001). This original genome sequence was created from eight bacterial artificial chromosome libraries (publicly funded project) and an additional 16 libraries from five donors (Celera project), and it became clear that this diversity was insufficient to describe genetic variation across and within populations. For example, large numbers of genotypes were needed to define the linkage disequilibrium patterns that are exploited by genetics studies in order to yield essentially whole-genome coverage by typing a relatively small number of “tagging” polymorphisms. Another limitation was that the allele frequencies of rare variants could only be estimated from large sample sizes. Further compounding the need for increased genetic data was the fact that linkage disequilibrium patterns and variant allele frequencies differ between ethnicities, thereby requiring reference samples from individuals from many different populations (The International HapMap Consortium, 2003).

Genotyping arrays The microarray allowed for rapid genotyping of hundreds of thousands, if not millions, of single-base variants throughout the human genome. Major efforts in genotyping, such as the HapMap Project (The International HapMap Consortium, 2003), have laid the groundwork for much work in population genetics, and with these datasets, the allele frequencies of common polymorphisms can now be estimated. Novembre et al. (2008) strikingly demonstrated the rich information available from such genotyping cohorts; using data from the multinational European Population Reference Sample study, they showed that geographic ancestry could be predicted with high spatial resolution. Microarray analyses of the associations between common polymorphisms and disease, known as genomewide association studies (GWAS), have used knowledge of the linkage disequilibrium patterns determined from large-scale studies to detect signals across the entire genome in a cost-effective manner. GWAS have found a huge number of variants associated with disease, identifying more than 10,000 single-nucleotide polymorphism (SNP)-trait associations in the past decade (Welter et al., 2014). The linkage disequilibrium patterns that can be calculated from large-scale genotyping datasets such as HapMap also allow for prediction of untyped variants based on calls from genotyped variants, a process known as imputation (Howie et al., 2011; The 1000 Genomes Project Consortium, 2012).

Table 7.1 Overview of genomic databases reviewed in this article Name DNA HapMap

1000 Genomes Project

Exome Variant Server (EVS)

Exome Aggregation Consortium (ExAC) Database of Genotypes and Phenotypes (dbGAP) The Human Gene Mutation Database (HGMD) ClinVar

NHGRI GWAS Catalog

Database for Nonsynonymous SNPs’ Functional Predictions (dbNSFP) Epigenetics and regulation TRANSFAC

Contents

Pitfalls

URL

Reference

Genomewide genotyping from 1301 individuals from 11 populations (phase III)

Specialized genotyping platform and imputation methods complicates comparison with other genotyping arrays Relatively low coverage sequencing, e.g., may undercall rare variants Lack of individual-level data; some of the sequenced subjects have heart, lung, and blood disorders Lack of individual-level data; some of the sequenced subjects are from disease cohorts Notoriously difficult to request access to individual-level data

http://hapmap.ncbi.nlm.nih. gov/

The International HapMap Consortium (2003)

http://www.1000genomes.org/

The 1000 Genomes Project Consortium (2012)

http://evs.gs.washington.edu/ EVS/

Fu et al. (2013)

http://exac.broadinstitute.org/

Lek et al. (2016)

http://www.ncbi.nlm.nih.gov/ gap

Tryka et al. (2014)

http://www.hgmd.cf.ac.uk/

Stenson et al. (2014)

http://www.ncbi.nlm.nih.gov/ clinvar/

Landrum et al. (2014)

http://www.genome.gov/ gwastudies/

Welter et al. (2014)

https://sites.google.com/site/ jpopgen/dbNSFP

Liu et al. (2013)

http://www.gene-regulation. com/pub/databases.html

Matys et al.(2006)

Whole-genome sequencing data for more than 2500 individuals from 26 populations (phase III) Exome sequencing data for 6503 individuals, primarily EuropeanAmericans and African-Americans (ESP6500) Exome sequencing data for 60,706 individuals covering 10,195,872 variants (version 0.3) Genetic and phenotypic data from 529 genotyping and sequencing studies (Jan 2015) Curated, published disease-causing and disease-associated variants, totaling 163,610 variant entries (release 2014.4) Submission-based associations between variants and disease, totaling 111,515 variant entries (March 2015) Published genomewide association studies, covering 2102 publications and 15,270 SNP associations (January 2015) Variant-based and gene-based functional effect annotations of 87,361,054 nonsynonymous SNPs (v 2.9)

Curated transcription factor-binding models and other information about 21,215 transcription factor conditions (version 7.0)

Many reported disease associations may be benign polymorphisms Not as comprehensive as HGMD commercial version yet; many reported disease associations may be benign Fields for reported study entries (e.g., effect size, risk allele) are sometimes incomplete Variant effect prediction scores are generally poor classifiers

Multiple (redundant) entries per transcription factor may complicate analysis; full version is commercially licensed

Continued

Table 7.1 Continued Name

Contents

Pitfalls

URL

Reference

JASPAR

Curated nonredundant transcription factorbinding models derived from SELEX, ChIP-chip, and ChIP-seq experiments

http://jaspar.genereg.net/

Mathelier et al. (2014)

Homo Sapiens Comprehensive Model Collection (HOCOMOCO) Encyclopedia of DNA Elements (ENCODE)

Curated transcription factor-binding models derived primarily from ChIP-seq experiments for 401 human transcription factors (January 2015) Results of 4626 genomics experiments, e.g., RNA sequencing, DNAseI hypersensitivity, transcription factor-binding sites, histone modifications (Jan 2015) Results of genomics experiments, e.g., bisulfite sequencing, DNAseI hypersensitivity, RNA sequencing, histone modifications, in 261 cell lines and ex vivo tissue samples

Nonredundant transcription factor-binding information may not be as accurate for certain experimental conditions Limited to humans; only a subset of transcription factors included

http://autosome.ru/ HOCOMOCO/

Kulakovskiy et al. (2013)

Most of the data is generated from cell lines which may not reflect the biology of normal tissue

https://www.encodeproject. org/

The ENCODE Project Consortium (2012)

Not all experiments performed in each tissue

http://www. roadmapepigenomics.org/

Bernstein et al. (2010)

Poorly annotated metadata in some cases Poorly annotated metadata in some cases; some redundancy with GEO Data stored in non-standard file types

http://www.ncbi.nlm.nih.gov/ geo/ https://www.ebi.ac.uk/ arrayexpress/

Barrett et al. (2013)

http://www.ncbi.nlm.nih.gov/ sra

Kodama et al. (2012)

http://www.ebi.ac.uk/gxa/ home

Petryszak et al. (2014)

http://www.brain-map.org/

Hawrylycz et al. (2014)

http://www.proteinatlas.org/

Uhlen et al. (2015)

Roadmap Epigenome Project

RNA and proteins NCBI Gene Expression Omnibus (GEO) EMBL-EBI ArrayExpress

Sequence Read Archive (SRA) EMBL-EBI Expression Atlas

Allen Brain Atlas

The Human Protein Atlas

Microarray dataset repository for 54,684 GEO Series (January 2015) Microarray dataset repository for 55,683 experiments (January 2015) Raw sequencing data and alignment information from high-throughput sequencing Gene expression summaries from 909 experiments, including the Illumina Body Map and ENCODE Cell Lines (January 2015) Extensive datasets, including gene expression data in mouse and human brain at various stages of development, including in situ hybridization, microarray, and RNA-sequencing Immunohistochemistry (based on 24,028 antibodies against 16,975 proteins) and transcriptomics analysis across 213 tissue and cell line samples

Antibodies may not be well validated

Kolesnikov et al. (2015)

Online Mendelian Inheritance in Man (OMIM) UniProt

Gene Ontology (GO)

Pathways and gene networks Kyoto Encyclopedia of Genes and Genomes (KEGG) Reactome

PubMatrix

Literature summaries of human genes and genetic diseases

http://www.omim.org/

Amberger et al. (2015)

Basic information about protein sequence, protein structure, posttranslational modifications, and literature summary Standardized, computer-readable gene annotations in a relational database

http://www.uniprot.org/

The UniProt Consortium, 2015

Gene Ontology enrichments often due to incorrect background model instead of biologic signal

http://geneontology.org/

The Gene Ontology Consortium, 2015

Sometimes difficult to find primary evidence supporting a KEGG pathway

http://www.genome.jp/kegg/

Kanehisa and Goto (2000)

http://www.reactome.org/

Croft et al. (2014)

Returns publications with co-occurrences of search terms, regardless of their relation to each other Time consuming to construct large networks, and text summaries may miss important information Natural language-processing algorithms occasionally misinterpret text from PubMed publications Bias toward well-studied proteins; may have a high false-positive rate

https://pubmatrix.irp.nia.nih. gov/

Becker et al. (2003)

http://www.ihop-net.org/

Hoffman and Valencia (2004)

http://www.chilibot.net/

Chen and Sharp (2004)

http://www.ebi.ac.uk/intact/

Kerrien et al. (2011)

Comprehensive for yeast data, which does not always translate into humans; may have a high false-positive rate Contains many low-quality associations for which the evidence of interaction is weak

http://thebiogrid.org/

Chatr-aryamontri et al. (2015)

http://string-db.org/

Szklarczyk et al., 2015

Pathway maps for 468 curated biologic pathways, and other genomic annotations (January 2015) Pathway maps for 1669 curated biologic pathways containing 7757 proteins with cross-references to other databases (v51) PubMed queries of pairwise combinations of terms from two lists

Information Hyperlinked over Proteins (iHOP)

Web tool for manually constructing gene networks

Chilibot

Automated PubMed-based gene network construction from a gene list

IntAct

Protein-protein interaction database containing 301,584 human interactions, as well as data from other species, from 13,297 publications (Jan 2015) Curated protein-protein interactions from 44,741 low-throughput and highthroughput studies (version 3.2.121)

Biological General Repository for Interaction Datasets (BioGRID) Search Tool for the Retrieval of Interacting Genes/Proteins (STRING)

Comprehensive collection of data from other protein interaction databases, computational prediction algorithms, interaction transfer from model organisms, automated text mining, and other sources, for 5,214,234 proteins from 1,133 organisms (Version 9.1)

Continued

Table 7.1 Continued Name

Contents

COEXPRESdb Database collections National Center for Biotechnology Information (NCBI) Ensembl Nucleic Acids Research Molecular Biology Database Collection GeneCards BioGPS

UCSC Genome Browser

WashU Epigenetics Genome Browser

Pitfalls

URL

Reference

Gene coexpression database derived from GEO data

http://coxpresdb.jp/

Okamura et al. (2015)

Links to NCBI resources, such as GEO and ClinVAR

http://www.ncbi.nlm.nih.gov/ guide/sitemap/

NCBI Resource Coordinators (2015)

Links to Ensembl resources, such as BioMart and Variant Effect Predictor Listing and links to more than 1500 databases (January 2015)

http://www.ensembl.org

Cunningham et al. (2015)

http://www.oxfordjournals. org/our_journals/nar/ database/c/ http://www.genecards.org/

Galperin et al. (2015)

Rebhan et al. (1998)

http://biogps.org/

Wu et al. (2009)

http://genome.ucsc.edu/

Kent et al. (2002)

http://epigenomegateway. wustl.edu/

Zhou et al. (2013)

Summary of a gene’s nomenclature, function, mRNA expression, literature, and others, with links to vendors and outside databases Summary of gene expression in different tissues, and links to outside resources such as Gene Ontology Genomic annotations overlaid on difficult to visualize 3-D data, chromosome physical coordinates for a such as from chromosomal variety of species conformation capture experiments; mitochondrial reference is sometimes out of date Publicly accessible annotations (e.g., ENCODE, Roadmap Epigenomics Project) and data analysis/visualization tools

EMBL-EBI, European Molecular Biology Laboratory-European Bioinformatics Institute; GWAS, genomewide association studies; NCBI, National Center for Biotechnology Information; NHGRI, National Human Genome Research Institute; SNP, single-nucleotide polymorphism; UCSC, University of California, Santa Cruz; WashU, Washington University.

BIOINFORMATICS AND GENOMIC DATABASES

81

Whole-genome and whole-exome sequencing

Disease genetics

Since the Human Genome Project, genome-sequencing technology has advanced to allow for drastic decreases in cost. The relatively low cost of genome sequencing in the present has made possible genome sequencing of large cohorts, analogous to the HapMap project. For example, The 1000 Genomes Project, as of 2015, consists of low-coverage (4) whole-genome sequencing of more than 2500 individuals. Single-nucleotide variants were identified, and additional variants were “filled in” using imputation (that is, inferring the presence of an unobserved variant by the observation of nearby polymorphisms that are in linkage disequilibrium) (The 1000 Genomes Project Consortium, 2012). Other types of variants, including small indels and structural variants, were also called. This dataset provides a publicly accessible reference for many aspects of genetics and genomics studies. For example, calls from The 1000 Genomes Project Consortium have been used for imputation for other datasets (Huang et al., 2012); identifying rare variants for prioritization in studies of Mendelian disease (Ng et al., 2010); and calculating how genetic variants affect gene expression (Lappalainen et al., 2013). Compared to genome sequencing, exome sequencing is currently the more cost-effective approach, and publicly available exome-sequencing datasets describe large sample sizes at high coverage. The exons of genes (collectively known as the exome), together comprising about 2–3% of the human genome sequence, are enriched for functional variants and have easily predicted effects on polypeptide sequence. Indeed, the majority of defined genetic causes of Mendelian diseases fall within the exome, likely because these variants are both more likely to be deleterious and have been relatively easy to analyze. Targeted sequencing of the exome is accomplished by enrichment of the genomic regions of interest by probe hybridization among the pool of DNA fragments, therefore reducing the amount of sequencing (and the cost) required per individual. The Exome Sequencing Project exome variant server currently contains variants from high-coverage exome sequencing of more than 6500 individuals (Fu et al., 2013). Recently, the Exome Aggregation Consortium released data from exome sequencing of 60,706 individuals (Lek et al., 2016). Recent work has continued to expand these datasets to include more subjects and cover more of the genome with whole-genome sequencing. With higher depth of coverage these resources provide a more comprehensive picture of genetic variation in coding regions, particularly for low-frequency variants, and have been used for comparison in many deep exome sequencing experiments (Rauch et al., 2012; Lim et al., 2013).

Raw data from GWAS, linking a patient’s genotype information with his or her diagnosis, can be found in databases such as the National Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGAP) (Tryka et al., 2014) and the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) (Wang et al., 2014). However, concerns about privacy (Lowrance and Collins, 2007) limit much of the publicly available data to summary statistics; access to individual-level data usually requires submitting an application and documenting Institutional Review Board approval. Despite these controls, genetic data from previous studies have been exploited in a number of diverse study designs. For example, meta-analyses for diseases such as Parkinson’s disease (International Parkinson Disease Genomics Consortium, 2011) have revealed novel associations by combining data from previous GWAS in order to increase statistical power. Novel methods for analyses such as meta-analysis and genetic correlation that can be applied to summary statistics have further expanded the utility of data from these genetic studies. At a higher level, results from disease–genotype associations and known genetic causes of disease have been catalogued in a variety of databases. One listing of disease-associated variants, the Human Gene Mutation Database (HGMD) (Stenson et al., 2014), contains a curated list of variants from the scientific literature and locus-specific mutation databases. Aside from containing data regarding the variant, disease, publication, and associated metadata, HGMD also reports the confidence of association. Mutation databases have been often utilized in clinical sequencing pipelines to diagnose genetic disease (Yang et al., 2013). However, although the commercial version of HGMD is quite comprehensive, even the variants with highest confidence are sometimes found to be prevalent in the general population, suggesting that they may not be pathogenic (Andreasen et al., 2013), and results must be interpreted with caution. A freely accessible public database, ClinVar, also contains disease-variant associations (Landrum et al., 2014). Like in HGMD, ClinVar entries are associated with links to journal articles or other supporting evidence, and report the type and confidence of the association. Often, catalogs of pathogenic variants focus on particular diseases, such as Alzheimer disease and frontotemporal dementia (Cruts et al., 2012) or autism spectrum disorder (Basu et al., 2009). However, even the largest such databases may have limited phenotype data to determine the pathogenicity of each variant with certainty, and reclassification often occurs (Cassa et al., 2013). While HGMD focuses

82

J. CHEN AND G. COPPOLA

on variants identified for many diseases using diverse study designs, the National Human Genome Research Institute GWAS catalog (Welter et al., 2014) contains genomewide significant polymorphisms reported in GWAS publications. Unlike HGMD and databases of reported pathogenic variants, GWAS data for most diseases (with some exceptions) cannot accurately predict disease risk, due to a low fraction of heritability explained. In aggregate, these data have been used to characterize disease-causing variants in general. A widely used naïve Bayes classifier, PolyPhen2, was trained on mutations known to cause Mendelian disease (Adzhubei et al., 2010). PolyPhen2, and other similar scores, can roughly predict the effect of a variant based on measures such as evolutionary conservation and genomic context; a number of such annotations are collected in the Database for Nonsynonymous SNPs’ Functional Predictions (dbNSFP) database (Liu et al., 2013). Recently, many groups have demonstrated enrichment of SNPs with known genomewide associations in regions marked by quantitative trait loci (Nicolae et al., 2010), DNase I hypersensitivity sites (Maurano et al., 2012), and other markers, suggesting ways to prioritize variants of interest in genetic studies.

EPIGENETIC MARKERS AND REGULATORY FEATURES The regulation of cellular, temporal, and spatial specificity in gene expression is a fundamental question of cell biology. While diverse cell types in an individual share similar DNA content, it is the regulation of how the encoded genes are expressed that gives rise to specific characteristics of the various cell lineages. Given the importance of regulation of gene expression, it is unsurprising that a large portion of the genome encodes regulatory elements and structural features relative to protein-coding sequences. Transcription of mRNA can be regulated by the binding of transcription factors and splicing factors to canonic DNA sequences; covalent modification of histones, such as by acetylation and methylation; and modification of DNA, such as by methylation and hydroxymethylation. As an example of epigenetic markers in clinical use, DNA methylation of the MGMT promoter region predicts prognosis and the outcome of treatment of malignant glioma with alkylating agents (Hegi et al., 2008) and has become a routine part of management. The insights gleaned from increased understanding of gene expression regulation and epigenetics may translate into further clinical utility.

Transcription factors An estimated 5–10% of known genes code for proteins that bind to DNA and regulate transcription, called

transcription factors (Messina et al., 2004). The central role of transcription factors in gene regulation has been illustrated by the striking discovery of induced pluripotent stem cells. Takahashi and Yamanaka (2006) discovered that expressing just four transcription factors in differentiated cells – Oct4, Sox2, c-Myc, and Klf4 – could revert the cell into a pluripotent state. Neurologic diseases have been attributed to disruption of certain transcription factors, such as the childhood apraxia of speech caused by loss of the transcription factor FOXP2 (Lai et al., 2001). Transcription factors typically bind to DNAwithin promoter or enhancer regions of target genes by recognizing a specific sequence motif at the transcription factor-binding site (TFBS). There, they may interact with components of the transcriptional machinery or activators or repressors. A basic model of this regulatory interaction thus requires an understanding of the specific transcription factor; where the transcription factor binds; and how that binding affects expression of mRNA transcription. A TFBS typically follows a certain pattern of nucleotides, called a motif. Methods such as systematic evolution of ligands by exponential enrichment (SELEX) are designed to identify the specific binding motif, which can be represented in a probabilistic fashion as a position weight matrix. TFBS can also be identified by high-throughput chromatin immunoprecipitation (ChIP) experiments, including ChIP followed by microarray (ChIP-chip), or ChIP followed by next-generation sequencing (ChIP-seq). TFBS motifs derived from these types of data are comprehensively catalogued in the TRANSFAC database, which provides position weight matrices from experiments of varying quality and across different biologic conditions (Matys et al., 2006). However, because multiple experiments (and hence, multiple-position weight matrices) may be associated with a single transcription factor, analyses using TRANSFAC may yield divergent results depending on which model is chosen. JASPAR is another comprehensive transcription factor database developed from curated SELEX, ChIP-chip, and ChIP-seq experiments (Mathelier et al., 2014). JASPAR addresses the redundancy issue by expert curation to determine the most reliable position weight matrix for each transcription factor, rather than maintaining multiple entries corresponding to individual experiments. Furthermore, the JASPAR database is freely accessible, unlike TRANSFAC. The recently developed Homo Sapiens Comprehensive Model Collection (HOCOMOCO) is another nonredundant, freely available TFBS database, though it is mostly based on high-throughput ChIP-seq data (Kulakovskiy et al., 2013). Transcription factor databases can also help to identify key regulators of biologic processes. Genes participating in common pathways are often co-regulated,

BIOINFORMATICS AND GENOMIC DATABASES and therefore may be enriched in specific TFBS. The F-Match program can identify over- or underrepresented TRANSFAC TFBS in a set of sequences (Kel et al., 2006). The ChIP Enrichment Analysis (ChEA) tests for enrichment of transcription factor targets within a gene list based on a database of published ChIP-chip, ChIPseq, ChIP-PET, and DamID experiments (Lachmann et al., 2010). In a similar approach expanded to include transcription factor motif databases, we have created a web application to identify enrichment of certain motifs (such as TFBS) in regulatory regions of a gene list, utilizing data from TFBS databases (including JASPAR, TRANSFAC, and HOCOMOCO) and other resources, available at https://tfenrichment.semel.ucla.edu/.

Histone modifications DNA is organized and compacted by protein octamers (known as histones) into chromatin. The structure of chromatin plays a role in the function of the DNA; for example, the tight packing of heterochromatin silences transcription whereas the looser packing of euchromatin enables transcription (Bannister and Kouzarides, 2011). Posttranslational modifications of histones at specific residues, such as acetylation, phosphorylation, methylation, ubiquitinylation, and ADP-ribosylation, control histone function by a “histone code” and enable the histone to in turn regulate genome function. For example, heterochromatin is associated with dimethylation (me2) and trimethylation (me3) of the ninth lysine (K9) residue of histone H3 (written in shorthand as H3K9me2 and H3K9me3), and euchromatin is associated with histone acetylation on H3K9 and other histone residues. The H3K9me3 mark recruits heterochromatin protein 1 proteins, which propagate the H3K9me3 mark and recruit additional proteins involved in heterochromatin organization (Bannister et al., 2001; Lachner et al., 2001). In a different mechanism, lysine acetylation reduces the histone’s positive charge and weakens electrostatic interactions with the surrounding DNA, thereby resulting in looser compaction (Shogren-Knaak et al., 2006). Histone modifications encode even finer divisions, with particular marks in the vicinity of promoters, enhancers, transcription start sites, and other regulatory regions. Opposing enzymes maintain the histone code by writing and erasing histone modifications, tightly controlling the chromatin state. Now, microarray-based and next-generation sequencing techniques have provided researchers with a glimpse into reading and interpreting the histone code in a variety of cell types. Several large efforts have recently attempted to map regulatory regions in the human genome. The Encyclopedia of DNA Elements (ENCODE) project performed a number of ChIP-seq, RNA-seq, DNase

83

I hypersensitivity, and other experiments to identify sites of histone modifications, transcription factor binding, mRNA expression, open chromatin, DNA methylation, and other regulatory features in a range of cell lines (The ENCODE Project Consortium, 2012). While useful, the focus on a small range of cell lines, rather than primary tissues, limits applicability to many human diseases. A complementary effort by the Roadmap Epigenomics Mapping Consortium produced data on stem cells and primary ex vivo tissue samples (Bernstein et al., 2010), providing a relevant dataset for many studies. The PsychENCODE project extends the study of regulatory genomic elements to specific brain regions and cell types involved in psychiatric disease. In an attempt to further translate the histone code, Ernst et al. (2011) trained a Hidden Markov Model (ChromHMM) to predict genomewide regulatory states in several cell types analyzed by the ENCODE project based on the profile of adjacent histone modifications. This approach also allowed inference of the target genes of putative regulatory regions, and has since been applied to the Roadmap Epigenomics Mapping Consortium data (Ernst and Kellis, 2012). Like coding regions that are enriched for variants (e.g., missense, nonsense, and splice-site variants) that cause human disease, regulatory regions have now been shown to harbor disease-causing variants, including GWAS hits (Gusev et al., 2014). The inverse is also true: understanding the regulatory context around genetic variants with a statistical association to disease can help to deduce the molecular mechanisms involved (Boyle et al., 2012; Schaub et al., 2012). Some authors have also proposed variant prioritization strategies that integrate regulatory annotations (Pickrell, 2014), leveraging our improved functional understanding of the vast majority of the genome that does not encode proteins.

RNA AND PROTEINS Following studies of auxotrophic strains of Neurospora, Beadle and Tatum (1941) proposed the highly influential one gene, one enzyme hypothesis in which a single gene encoded an enzyme that catalyzed a single step in a metabolic reaction. Although now perceived as overly simplistic, this model provides a useful framework in which to understand how genetic and epigenetic variation can lead to human disease via changes in RNA and protein expression. Some genetic variants lead to changes in the amino acid sequence of a protein, while others may lead to changes in the amount of RNA or protein produced. Therefore, quantifying RNA and protein and understanding the role of proteins in biologic processes can yield insight into gene function and disease mechanisms.

84

J. CHEN AND G. COPPOLA

Publicly available gene expression data Gene expression data from published and unpublished studies are often stored in the NCBI Gene Expression Omnibus (GEO) (Barrett et al., 2013) or the European Bioinformatics Institute (EBI) ArrayExpress (Kolesnikov et al., 2015) databases. GEO and ArrayExpress are designed to contain raw data, final processed data, and annotation regarding experimental design, microarray platform, and data processing, as described in the Minimum Information About a Microarray Experiment (MIAME) guidelines (Brazma et al., 2001). Raw data deposited at GEO have been used to identify errors in data interpretation; in one instance, a claim of widespread RNA editing was attributed to platform artifacts (Pickrell et al., 2012). Publicly available expression data have also been used to support and conduct numerous independent studies. For example, we identified methylation quantitative trait loci in a study of dementia patients, and were able to validate these findings using high-throughput data from an independent study (Li et al., 2014). Complementary to the array databases, the Sequence Read Archive (SRA) has been developed for storage of massively parallel sequencing data (Kodama et al., 2012), and will be useful as RNA- and DNA-sequencing techniques have become more widely used.

Spatiotemporal patterns of gene expression Where and when a gene is transcribed and translated reveals a great deal about its function; a gene expressed specifically in a tissue or a particular stage of development is likely involved in biologic processes of importance in those scenarios. Furthermore, the regulation of gene expression is also typically tissue-specific, and therefore must be studied in the appropriate context. Several databases illustrate the tissue-specific expression of mRNA. One example, the European Molecular Biology Laboratory (EMBL)-EBI Expression Atlas, contains curated data from published microarray and RNAseq experiments that measure gene expression in human tissues, cell lines, and animal models (Petryszak et al., 2014). The user interface enables easy navigation and displays a graphic representation of expression across different conditions. In mouse and human brain, the Allen Brain Atlas provides: (1) in situ hybridization images of selected genes in mouse brain at adulthood and various stages of development (Mouse Brain Atlas and Developing Mouse Brain Atlas); (2) microarray data in human brain across cortical and subcortical brain regions (Human Brain Atlas); (3) RNA sequencing, exon microarray, and in situ hybridization images of selected genes in developing human brain (as part of the

BrainSpan Atlas); and (4) mRNA expression data in nonhuman primates, mouse spinal cord, and glioblastoma (Hawrylycz et al., 2014). Gene expression data in the developing human brain have been used to infer the specific stages of development and brain regions that are important in the pathogenesis of autism (Parikshak et al., 2013). Another rich dataset, the Genotype-Tissue Expression (GTEx) project, combines tissue-specific gene expression data (from RNA-seq) from autopsy specimens with paired genotype data from the same individuals (Lonsdale et al., 2013). This combination of datatypes additionally enables the study of genetic influences on gene expression in a wide array of tissue types, including the prediction of downstream gene expression effects of disease-associated regulatory genetic variation (Gamazon et al., 2015) as just one example. More recently, several groups have directly quantified proteins across human tissues. Wilhelm et al. (2014) and Kim et al. (2014) describe mass spectrometry approaches to map the human proteome in a variety of tissues and to identify novel protein-coding genes (such as putative noncoding RNAs). Uhlen et al. (2015) describe a different approach using immunohistochemistry of tissue microarrays in combination with RNAseq. Trained pathologists annotated more than 13 million images generated with 24,028 antibodies. In parallel, the authors performed RNA sequencing to determine the abundance of transcripts in various tissues and quantify tissuespecific gene expression, available on the Human Protein Atlas website. While the tissue microarray method allows better assessment of the cellular distribution of protein expression compared to mass spectrometry methods, the measurement is much more qualitative and cannot identify novel peptides. Furthermore, the reliability of many of the antibodies is uncertain.

Gene function Disease-associated and tissue-specific expression of a gene determined from high-throughput data provides clues about gene function, but careful investigation and characterization have yielded more definitive information. Databases such as the Online Mendelian Inheritance in Man (OMIM) catalog information known about a gene through the literature. OMIM provides curated reviews of important publications regarding a gene’s relation to human disease, as well as gene function, gene structure, population genetics, and other topics (Amberger et al., 2015). The Universal Protein Resource (UniProt) also reviews the known functions of a protein, as well as providing links to many external resources (including OMIM) (The UniProt Consortium, 2015). It

BIOINFORMATICS AND GENOMIC DATABASES further reviews the literature on posttranslational modifications, functional domains, and isoforms of the protein, summarizing these data in concise tables. Though these tools are very useful, comparing gene functions and performing computational analysis on a large set of genes could be challenging owing to a lack of standardization in language. Gene Ontology (GO) provides annotations of a gene’s role in normal cells using a controlled vocabulary in three basic domains – cellular component, biologic process, and molecular function – and can be searched using the AmiGO browser (The Gene Ontology Consortium, 2015). For example, among GO terms associated with the MTOR gene (encoding the mechanistic target of rapamycin, mTOR) are the biologic processes “cellular response to nutrient levels,” “negative regulation of cell size,” and “positive regulation of translation”; the cellular components “cytosol,” “neuronal cell body,” and “dendrite”; and the molecular functions “drug binding” and “protein serine/threonine kinase activity.” These terms paint a picture of the known role of the mTOR complex 1 in integrating signals from cell nutrient levels and stressors and controlling protein synthesis via phosphorylation; other GO terms are also associated with mTOR complex 2 functions. The use of standardized terms linked to genes also enables the calculation of whether a given gene list is enriched in a particular GO term compared with a background set. GO term enrichment analysis can be directly performed on the GO website, as well as other web-based services such as the Database for Annotation, Visualization, and Integrated Discovery (DAVID) (Huang et al., 2008) and downloadable software such as GO-Elite (Zambon et al., 2012). A report of diagnostic exome sequencing in patients with intellectual disability used GO enrichment of known disease genes to identify biologic processes in which these genes participated; these GO terms were then used to determine whether identified variants were likely to be pathogenic. The authors report a diagnostic yield of 16% in this cohort (de Ligt et al., 2012).

PATHWAYS AND GENE NETWORKS Gene set enrichment analyses begin to illustrate the idea that the function of genes can be better understood in relation to other genes, rather than in isolation, since genes rarely act alone. Genes that participate in particular pathways or biologic processes are often co-regulated or encode for proteins that form complexes with other proteins. Therefore, the function of a poorly characterized gene can be predicted from functional enrichment of a list of well-characterized, interacting/coexpressed

85

genes in a guilt-by-association approach (Aravind, 2000; Dougherty et al., 2005). Furthermore, proteins with redundant functions may increase the robustness of a biologic process, and alterations in key signaling proteins may have far-reaching downstream effects. These emergent phenomena can be modeled and captured in higher-order models of gene interactions, or gene networks. The mathematics of graph theory have been applied to gene networks to understand the interactions between genes, to identify important gene sets involved in disease, and to pinpoint specific genes that may be particularly important for cellular function (Jeong et al., 2000; Barabasi et al., 2011).

Pathways Pathway databases move beyond gene ontologies to put interactions among proteins and other molecules into biologic context. For example, in glycolysis a series of enzymes sequentially converts glucose into pyruvate through biochemical reactions rather than direct protein interaction. The Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) and Reactome (Croft et al., 2014) databases provide highly curated pathway information about fundamental processes such as metabolism, cell cycle, membrane trafficking, and many others, with links to primary literature sources. Enrichment of gene lists in pathways can be determined for pathways much like for GO terms, and tools such as DAVID (with KEGG pathways) and EnrichR (Kuleshov et al., 2016) can perform such analyses.

Biomedical literature mining Curated knowledge based on scientific literature forms the basis of tools such as GO and pathway databases, but links between genes can often be found by computationally mining the text of published abstracts and manuscripts. At a very basic level, the PubMatrix tool extends simple PubMed queries by searching for pairwise combinations of keywords from two user-defined lists (Becker et al., 2003). For example, inputting two gene lists will return, for each pairwise combination of genes from the two lists, the number of publications referencing both genes. Its use can be extended beyond gene lists to lists of publication dates, biologic processes, diseases, and any other possible PubMed queries. For gene interactions, however, the co-occurrence of two genes in a manuscript does not imply any particular relationship between the genes, e.g., direct interaction of each other or participation in a common pathway. The Information Hyperlinked over Proteins (iHOP) tool

86

J. CHEN AND G. COPPOLA

addresses this problem by allowing users to manually curate PubMed-indexed publications. A search on iHOP extracts a text summary of a particular interaction mined from a publication, which the user can add to a gene network model, and subsequently save or export (Hoffmann and Valencia, 2004). A similar tool, Chilibot, automates this annotation process using natural language-processing methods to automatically assign a relationship (e.g., noninteracting, interacting, activating, or inhibiting) from analysis of the PubMed search results (Chen and Sharp, 2004). Compared to databases of curated information from literature, text-mining techniques can often address more specific questions about particular sets of genes. However, because of the difficulty in parsing or understanding written text, they may be more prone to errors. Furthermore, they are highly biased toward well-known and well-studied interactions between genes, and may not accurately reflect the gene’s true role in a particular context.

Protein–protein interactions (PPI) Direct interactions between proteins can be determined experimentally. Interactions can be tested one at a time (lowthroughput methods, such as co-immunoprecipitation), or across a large number of potential interacting proteins (high-throughput methods, such as yeast two-hybrid screening and affinity purification coupled to mass spectrometry). While low-throughput methods are limited in scope and relatively expensive, high-throughput methods suffer from a high false-positive rate and poor agreement between experiments (Mrowka et al., 2001; Edwards et al., 2002; Huang et al., 2007). In fact, one database even contains background PPIs (false positives) detected by affinity purification coupled to mass spectrometry and has been colorfully named the “CRAPome” (Mellacheruvu et al., 2013). Initial large-scale PPI studies assayed approximately 4000–8000 genes using high-throughput methods such as yeast two-hybrid systems and validated by orthogonal assays (Rual et al., 2005; Stelzl et al., 2005). A recent update to these systematic screens surveyed interactions between a set of 13,000 open reading frames, representing most of the known protein-coding genes in humans (Rolland et al., 2014). Although these studies have provided insight into the human interactome, much of the PPI data comes from a collection of publications with much smaller scope, which are catalogued in several databases. The IntAct database is one of the largest curated PPI databases, containing (as of October 2017) 785,947 interactions in human, yeast, Drosophila melanogaster, Escherichia coli, Caenorhabditis elegans, and other species from 19,965 publications, including a

mix of low-throughput and high-throughput methodologies (Kerrien et al., 2011). Other examples of curated PPI databases include DIP (Salwinski et al., 2004), Biological General Repository for Interaction Datasets (BioGRID) (Chatr-aryamontri et al., 2015), MINT (Licata et al., 2012), InWeb (Li et al., 2017), and others. Utilizing several primary PPI databases, the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) combines data from multiple outside sources as well as interactions derived from computational predictions and text mining (Szklarczyk et al., 2015). While STRING contains a large number of interactions, many are low-quality associations for which the evidence of interaction is rather weak, and therefore must be interpreted with caution. Most of these databases differentiate between interactions found in low- versus high-throughput data; interactions supported by low-throughput experiments are typically more accurate. However, reliance on lowthroughput data also introduces biases into PPI databases. Proteins that have been studied extensively tend to have more known interactions than proteins on which few experiments have been performed. PPI databases also do not currently distinguish data from different cell types, and therefore may not be generalizable to specific tissues (e.g., brain), though several groups have developed computational approaches in attempts to circumvent this limitation (Guan et al., 2012; Magger et al., 2012). For an excellent review, Parikshak et al. (2015) discuss the pitfalls and strengths of networks to understand neurodevelopment and neurodegeneration. Overall, PPIs from databases should be viewed as a potentially useful, but error-prone, biased, and incomplete picture of the true biologic interactome. PPI networks have guided the identification of disease genes. As an example, Novarino et al. (2014) constructed a gene network associated with hereditary spastic paraplegia by mapping disease genes to interaction data from STRING and other databases. This disease-associated subnetwork was then used to prioritize additional genes harboring potentially pathogenic genetic variants that may play a role in hereditary spastic paraplegia pathogenesis. Furthermore, PPI networks can validate functional networks identified in independent experiments. One such approach, the Disease Association Protein– Protein Link Evaluator (DAPPLE) algorithm, uses PPIs from InWeb (based on data collected from IntAct and other databases) to identify whether a gene list is significantly enriched in interacting proteins (Rossin et al., 2011). The DAPPLE approach has been used to test interconnected gene networks in diseases such as schizophrenia (Xu et al., 2012), Alzheimer disease (Raj et al., 2012), and autism (Parikshak et al., 2013).

BIOINFORMATICS AND GENOMIC DATABASES

Coexpression Similarly to proteins, gene products on the mRNA level (comprising the transcriptome) interact in measurable ways that can be inferred from gene expression data. One such approach, gene coexpression, relies on correlation of gene expression between two genes across multiple samples. Coexpression may indicate coordinated gene regulation or participation in common biologic pathways (Stuart et al., 2003). Difference in celltype composition may also lead to the appearance of coexpression; for instance, changes in the abundance of neurons, oligodendrocytes, astrocytes, and microglia between brain regions result in a cell type-specific coexpression network (Oldham et al., 2008). Therefore, the resolution of datasets (e.g., at the cell, tissue, or organismal level) informs the interpretation of gene coexpression. Databases such as COXPRESdb calculate and store coexpression relationships between genes. COEXPRESdb determines gene coexpression from public microarray and RNAseq data in multiple species, and further integrates PPI data (Okamura et al., 2015). Another database, Search-based Exploration of Expression Compendium (SEEK), assesses gene coexpression (optionally subdivided by broad tissue categories) in human datasets (Zhu et al., 2015). Software packages, such as Weighted Gene Coexpression Network Analysis (WGCNA), can also calculate coexpression on custom data (Langfelder and Horvath, 2008). This approach has been used to identify modules of coexpressed genes related to glioblastoma (Horvath et al., 2006), Alzheimer disease (Miller et al., 2008), autism (Voineagu et al., 2011), and many others (Parikshak et al., 2015).

DATABASE COLLECTIONS With a wealth of resources, many of which provide orthogonal or complementary information, collections of databases have become very useful in their own right. In its most basic form, the database collection can help users to locate relevant databases. The NCBI hosts a number of bioinformatics databases, such as ClinVar, dbSNP, GEO, and RefSeq, which are organized and listed on its site (NCBI Resource Coordinators, 2015). The Ensembl website contains a similar set of resources, including the Ensembl Genome Browser, BioMart, and the Variant Effect Predictor (Cunningham et al., 2015). More extensive in scope, the journal Nucleic Acids Research maintains the NAR Molecular Biology Database Collection, which accompanies its annual database issue (Galperin et al., 2015). As of January 2017, this web resource provides a listing of nearly 1,700

87

currently maintained databases, which can be organized alphabetically or by category. On a gene-by-gene level, the GeneCards website provides automatically updated information and links to outside databases such as OMIM, UniProt, PubMed, and many others (Rebhan et al., 1998). Another excellent resource, BioGPS, displays nomenclature and tissue-specific expression patterns from public data for a query gene, and can be customized to display data from literature searches, pathway databases, expression data, protein interaction databases, and other annotations (Wu et al., 2009). More advanced database collections provide annotation, data analysis, and visualization features. Ensembl’s BioMart project maintains databases of gene nomenclature and GO annotations, enabling easy conversion between gene names and accession numbers from other databases (Kinsella et al., 2012). Genetic variants can also be annotated with their predicted molecular effect, presence in large cohorts, and many other descriptions from a number of databases using programs such as ANNOVAR (Wang et al., 2010) and the Ensembl Variant Effect Predictor (McLaren et al., 2010). The EnrichR web portal provides user-friendly exploratory analysis of gene sets to test for enrichment in datasets such as KEGG, disease perturbations in GEO, and many others (Kuleshov et al., 2016). These tools are well suited to rapid annotation of large datasets. For visualization of multiple databases on the genome level, the University of California, Santa Cruz genome browser (Kent et al., 2002) is one of the most useful and widely used bioinformatics tools, and incorporates data from numerous databases to display a graphic representation of a given genomic region. Users navigate the chromosomes of the human genome (or genomes of other species) as a linear sequence, which can be magnified around a user-defined region. Genomic data, such as gene definitions, genetic variants, published annotations, and literature data, can then be overlaid on the sequence as tracks (Fig. 7.2). Custom tracks, e.g., from RNA-seq or ChIP-seq peaks, or linkage analysis results, can also be displayed. Other, more specialized genome browsers have additional features. The WashU Epigenome Browser hosts data from the Roadmap Epigenomics Project, as well as other genomics datasets (Zhou et al., 2013). Uniquely, this genome browser can display long-range interaction data, such as ChIA-PET (Zhou et al., 2013), and perform simple data analysis procedures such as producing scatterplots and summarizing data over an individual gene or a gene set (such as a KEGG pathway). Such tools allow for easy integration of many types of genomic data in a given analysis.

88

J. CHEN AND G. COPPOLA

chr17 (q21.31) p13.3p13.2 p13.1

17p12

17p11.2

17q11.2

17q12

17q22

23.2

44,050,000 RefSeq Genes

MAPT-AS1

17q25.3

44,100,000 STH

MAPT MAPT MAPT MAPT MAPT MAPT MAPT MAPT MAPT-IT1

Layered H3K27Ac

24.2 q24.3 q25.1

hg19

100 kb 44,000,000

Scale chr17:

21.2q21.31

KANSL1

KANSL1 KANSL1 ClinVar Variants, CNVs Excluded MAPT:c.14G>A MAPT:c.1721A>C MAPT:c.2167C>T MAPT:c.14G>T MAPT:c.1747C>G KANSL1:c.1867_1870del MAPT:c.1766G>T MAPT:c.1859G>T NM_016835.4:c.1866 + 14C>T MAPT:c.1901A>T MAPT:c.1910C>T MAPT:c.1960G>A MAPT:c.1976A>T MAPT:c.2006C>T MAPT:c.2057A>T MAPT, GLY389ARG KANSL1:c.2785_2786delAG H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3)

DNase Clusters Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs Txn Factor ChIP 100 vertebrates Basewise Conservation by PhyloP

4.88 100 Vert. Cons −4.5

Multiz Alignments of 100 Vertebrates

Rhesus Mouse Dog Elephant Chicken X_tropicalis Zebrafish Lamprey Simple Nucleotide Polymorphisms (dbSNP 141) Found in > = 1% of Samples Common SNPs(141) Repeating Elements by RepeatMasker RepeatMasker

Fig. 7.2. Usage example for the University of California, Santa Cruz genome browser. The browser image depicts histone markers, evolutionary conservation, common variants, and repeat regions up to a single-base resolution around the MAPT gene, encoding the microtubule-associated protein tau.

CONCLUSION AND FUTURE DIRECTIONS We are currently in the midst of an exciting juncture in medicine. Genes and genetic variants that contribute to neurologic diseases are being identified at an unprecedented pace. At the same time, public databases house a plethora of genomic datasets that can place novel genetic findings into context in a comprehensive and user-friendly manner. Sir Isaac Newton proclaimed that he could see further because he was “standing on the shoulders of giants,” a sentiment which rings loudly in this new era of genomic science. Understanding and exploiting available bioinformatics databases has allowed clinicians to predict the likely molecular effect of variants of uncertain significance, and enabled researchers to perform

integrative analyses leveraging genetic, epigenetic, transcriptomic, and proteomic datasets to point toward important biologic mechanisms of disease. On the horizon, genomics will transcend its current uses in rare monogenic forms of disease by incorporating common risk polymorphisms using techniques such as polygenic risk scores (Dudbridge, 2013) and integration with other forms of genomic data, e.g., mRNA expression (Gonorazky et al., 2016). Drug discovery may be more focused on molecular mechanisms ascertained from genetics and transcriptomics, with companion diagnostics to maximize their efficacy (Verbist et al., 2015). Large-scale studies that combine multiple levels of genomic and phenotypic data in patient populations of interest will enable these developments and continue the inexorable transformation of medicine.

BIOINFORMATICS AND GENOMIC DATABASES

REFERENCES Adzhubei IA, Schmidt S, Peshkin L et al. (2010). A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249. Amberger JS, Bocchini CA, Schiettecatte F et al. (2015). OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 43: D789–D798. Andreasen C, Nielsen JB, Refsgaard L et al. (2013). New population-based exome data are questioning the pathogenicity of previously cardiomyopathy-associated genetic variants. Eur J Hum Genet 21: 918–928. Aravind L (2000). Guilt by association: contextual information in genome analysis. Genome Res 10: 1074–1077. Bannister AJ, Kouzarides T (2011). Regulation of chromatin by histone modifications. Cell Res 21: 381–395. Bannister AJ, Zegerman P, Partridge JF et al. (2001). Selective recognition of methylated lysine 9 on histone H3 by the HP1 chromo domain. Nature 410: 120–124. Barabasi A-L, Gulbahce N, Loscalzo J (2011). Network medicine: a network-based approach to human disease. Nat Rev Genet 12: 56–68. Barrett T, Wilhite SE, Ledoux P et al. (2013). NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res 41: D991–D995. Basu SN, Kollu R, Banerjee-Basu S (2009). AutDB: a gene reference resource for autism research. Nucleic Acids Res 37: D832–D836. Beadle GW, Tatum EL (1941). Genetic control of biochemical reactions in neurospora. Proc Natl Acad Sci U S A 27: 499–506. Becker K, Hosack D, Dennis G et al. (2003). PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4: 61. Bernstein BE, Stamatoyannopoulos JA, Costello JF et al. (2010). The NIH roadmap epigenomics mapping consortium. Nat Biotech 28: 1045–1048. Boyle AP, Hong EL, Hariharan M et al. (2012). Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 22: 1790–1797. Brazma A, Hingamp P, Quackenbush J et al. (2001). Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat Genet 29: 365–371. Cassa CA, Tong MY, Jordan DM (2013). Large numbers of genetic variants considered to be pathogenic are common in asymptomatic individuals. Hum Mutat 34: 1216–1220. Chatr-aryamontri A, Breitkreutz B-J, Oughtred R et al. (2015). The BioGRID interaction database: 2015 update. Nucleic Acids Res 43: D470–D478. Chen H, Sharp B (2004). Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5: 147. Croft D, Mundo AF, Haw R et al. (2014). The Reactome pathway knowledgebase. Nucleic Acids Res 42: D472–D477. Cruts M, Theuns J, Van Broeckhoven C (2012). Locus-specific mutation databases for neurodegenerative brain diseases. Hum Mutat 33: 1340–1344.

89

Cunningham F, Amode MR, Barrell D et al. (2015). Ensembl 2015. Nucleic Acids Res 43: D662–D669. de Ligt J, Willemsen MH, van Bon BWM et al. (2012). Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med 367: 1921–1929. Dougherty JD, Garcia ADR, Nakano I et al. (2005). PBK/TOPK, a proliferating neural progenitor-specific mitogen-activated protein kinase kinase. J Neurosci 25: 10773–10785. Dudbridge F (2013). Power and predictive accuracy of polygenic risk scores. PLoS Genet 9: e1003348. Edwards AM, Kus B, Jansen R et al. (2002). Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet 18: 529–536. Ernst J, Kellis M (2012). ChromHMM: automating chromatinstate discovery and characterization. Nat Methods 9: 215–216. Ernst J, Kheradpour P, Mikkelsen TS et al. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473: 43–49. Fu W, O’Connor TD, Jun G et al. (2013). Analysis of 6,515 exomes reveals the recent origin of most human proteincoding variants. Nature 493: 216–220. Galperin MY, Rigden DJ, Ferna´ndez-Sua´rez XM (2015). The 2015 nucleic acids research database issue and molecular biology database collection. Nucleic Acids Res 43: D1–D5. Gamazon ER, Wheeler HE, Shah KP et al. (2015). A genebased association method for mapping traits using reference transcriptome data. Nat Genet 47: 1091–1098. Gonorazky H, Liang M, Cummings B et al. (2016). RNAseq analysis for the diagnosis of muscular dystrophy. Ann Clin Transl Neurol 3: 55–60. Guan Y, Gorenshteyn D, Burmeister M et al. (2012). Tissuespecific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol 8: e1002694. Gusev A, Lee SH, Trynka G et al. (2014). Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet 95: 535–552. Hawrylycz M, Ng L, Feng D et al. (2014). The Allen brain atlas. In: N Kasabov (Ed.), Springer Handbook of Bio-/Neuroinformatics. Springer, Berlin, pp. 1111–1126. Hegi ME, Liu L, Herman JG et al. (2008). Correlation of O6-methylguanine methyltransferase (MGMT) promoter methylation with clinical outcomes in glioblastoma and clinical strategies to modulate MGMT activity. J Clin Oncol 26: 4189–4199. Hoffmann R, Valencia A (2004). A gene network for navigating the literature. Nat Genet 36: 664. Horvath S, Zhang B, Carlson M et al. (2006). Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci 103: 17402–17407. Howie B, Marchini J, Stephens M (2011). Genotype imputation with thousands of genomes. G3: Genes. Genomes, Genetics 1: 457–470. Huang H, Jedynak BM, Bader JS (2007). Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol 3: e214.

90

J. CHEN AND G. COPPOLA

Huang DW, Sherman BT, Lempicki RA (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protocols 4: 44–57. Huang J, Ellinghaus D, Franke A et al. (2012). 1000 Genomesbased imputation identifies novel and refined associations for the Wellcome Trust case control consortium phase 1 data. Eur J Hum Genet 20: 801–805. International Human Genome Sequencing Consortium et al. (2001). Initial sequencing and analysis of the human genome. Nature 409: 860–921. International Parkinson Disease Genomics Consortium (2011). Imputation of sequence variants for identification of genetic risks for Parkinson’s disease: a meta-analysis of genomewide association studies. The Lancet 377: 641–649. Jeong H, Tombor B, Albert R et al. (2000). The large-scale organization of metabolic networks. Nature 407: 651–654. Kanehisa M, Goto S (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30. Kel A, Voss N, Jauregui R et al. (2006). Beyond microarrays: finding key transcription factors controlling signal transduction pathways. BMC Bioinformatics 7: S13. Kent WJ, Sugnet CW, Furey TS et al. (2002). The human genome browser at UCSC. Genome Res 12: 996–1006. Kerrien S, Aranda B, Breuza L et al. (2011). The IntAct molecular interaction database in 2012. Nucleic Acids Res. Kim M-S, Pinto SM, Getnet D et al. (2014). A draft map of the human proteome. Nature 509: 575–581. Kinsella RJ, K€ah€ari A, Haider S et al. (2012). Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 40: D841–D846. Kodama Y, Shumway M, Leinonen R (2012). The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res 40: D54–D56. Kolesnikov N, Hastings E, Keays M et al. (2015). ArrayExpress update – simplifying data submissions. Nucleic Acids Res 43: D1113–D1116. Kulakovskiy IV, Medvedeva YA, Schaefer U et al. (2013). HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 41: D195–D202. Kuleshov MV, Jones MR, Rouillard AD et al. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res 44: W90–W97. Lachmann A, Xu H, Krishnan J et al. (2010). ChEA: transcription factor regulation inferred from integrating genomewide ChIP-X experiments. Bioinformatics 26: 2438–2444. Lachner M, O’Carroll D, Rea S et al. (2001). Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature 410: 116–120. Lai CSL, Fisher SE, Hurst JA et al. (2001). A forkhead-domain gene is mutated in a severe speech and language disorder. Nature 413: 519–523. Landrum MJ, Lee JM, Riley GR et al. (2014). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42: D980–D985. Langfelder P, Horvath S (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9: 559.

Lappalainen T, Sammeth M, Friedlander MR et al. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506–511. Lek M, Karczewski KJ, Minikel EV et al. (2016). Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536 (7616): 285–291. PMID: 27535533. Li Y, Chen JA, Sears RL et al. (2014). An epigenetic signature in peripheral blood associated with the haplotype on 17q21.31, a risk factor for neurodegenerative tauopathy. PLoS Genet 10: e1004211. Li T, Wernersson R, Hansen RB et al. (2017). A scored human protein–protein interaction network to catalyze genomic interpretation. Nat Meth 14: 61–64. Licata L, Briganti L, Peluso D et al. (2012). MINT, the molecular interaction database: 2012 update. Nucleic Acids Res 40: D857–D861. Lim ET, Raychaudhuri S, Sanders SJ et al. (2013). Rare complete knockouts in humans: population distribution and significant role in autism spectrum disorders. Neuron 77: 235–242. Liu X, Jian X, Boerwinkle E (2013). dbNSFP v2.0: A database of human non-synonymous SNVs and their functional predictions and annotations. Hum Mutat 34: E2393–E2402. Lonsdale J, Thomas J, Salvatore M et al. (2013). The Genotype-Tissue Expression (GTEx) project. Nat Genet 45: 580–585. Lowrance WW, Collins FS (2007). Identifiability in genomic research. Science 317: 600–602. Magger O, Waldman YY, Ruppin E et al. (2012). Enhancing the prioritization of disease-causing genes through tissue specific protein interaction networks. PLoS Comput Biol 8: e1002690. Mathelier A, Zhao X, Zhang AW et al. (2014). JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 42: D142–D147. Matys V, Kel-Margoulis OV, Fricke E et al. (2006). TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–D110. Maurano MT, Humbert R, Rynes E et al. (2012). Systematic localization of common disease-associated variation in regulatory DNA. Science 337: 1190–1195. McLaren W, Pritchard B, Rios D et al. (2010). Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor. Bioinformatics 26: 2069–2070. Mellacheruvu D, Wright Z, Couzens AL et al. (2013). The CRAPome: a contaminant repository for affinity purification-mass spectrometry data. Nat Methods 10: 730–736. Messina DN, Glasscock J, Gish W et al. (2004). An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression. Genome Res 14: 2041–2047.

BIOINFORMATICS AND GENOMIC DATABASES Miller JA, Oldham MC, Geschwind DH (2008). A systems level analysis of transcriptional changes in Alzheimer’s disease and normal aging. J Neurosci 28: 1410–1420. Mrowka R, Patzak A, Herzel H (2001). Is there a bias in proteome research? Genome Res 11: 1971–1973. NCBI Resource Coordinators (2015). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43: D6–D17. Ng SB, Bigham AW, Buckingham KJ et al. (2010). Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42: 790–793. Nicolae DL, Gamazon E, Zhang W et al. (2010). Traitassociated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6: e1000888. Novarino G, Fenstermaker AG, Zaki MS et al. (2014). Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Science 343: 506–511. Novembre J, Johnson T, Bryc K et al. (2008). Genes mirror geography within Europe. Nature 456: 98–101. Okamura Y, Aoki Y, Obayashi T et al. (2015). COXPRESdb in 2015: coexpression database for animal species by DNAmicroarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res 43: D82–D86. Oldham MC, Konopka G, Iwamoto K et al. (2008). Functional organization of the transcriptome in human brain. Nat Neurosci 11: 1271–1282. Parikshak NN, Luo R, Zhang A et al. (2013). Integrative functional genomic analyses implicate specific molecular pathways and circuits in autism. Cell 155: 1008–1021. Parikshak NN, Gandal MJ, Geschwind DH (2015). Systems biology and gene networks in neurodevelopmental and neurodegenerative disorders. Nat Rev Genet 16: 441–458. Petryszak R, Burdett T, Fiorelli B et al. (2014). Expression Atlas update – a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments. Nucleic Acids Res 42: D926–D932. Pickrell JK (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet 94: 559–573. Pickrell JK, Gilad Y, Pritchard JK (2012). Comment on “Widespread RNA and DNA sequence differences in the human transcriptome”. Science 335: 1302. Raj T, Shulman JM, Keenan BT et al. (2012). Alzheimer disease susceptibility loci: evidence for a protein network under natural selection. Am J Hum Genet 90: 720–726. Rauch A, Wieczorek D, Graf E et al. (2012). Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. The Lancet 380: 1674–1682. Rebhan M, Chalifa-Caspi V, Prilusky J et al. (1998). GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14: 656–664.

91

Rolland T, Tas¸an M, Charloteaux B et al. (2014). A proteomescale map of the human interactome network. Cell 159: 1212–1226. Rossin EJ, Lage K, Raychaudhuri S et al. (2011). Proteins encoded in genomic regions associated with immunemediated disease physically interact and suggest underlying biology. PLoS Genet 7: e1001273. Rual J-F, Venkatesan K, Hao T et al. (2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature 437: 1173–1178. Salwinski L, Miller CS, Smith AJ et al. (2004). The database of interacting proteins: 2004 update. Nucleic Acids Res 32: D449–D451. Schaub MA, Boyle AP, Kundaje A et al. (2012). Linking disease associations with regulatory information in the human genome. Genome Res 22: 1748–1759. Shogren-Knaak M, Ishii H, Sun J-M et al. (2006). Histone H4-K16 acetylation controls chromatin structure and protein interactions. Science 311: 844–847. Stelzl U, Worm U, Lalowski M et al. (2005). A human protein– protein interaction network: a resource for annotating the proteome. Cell 122: 957–968. Stenson P, Mort M, Ball E et al. (2014). The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 133: 1–9. Stuart JM, Segal E, Koller D et al. (2003). A genecoexpression network for global discovery of conserved genetic modules. Science 302: 249–255. Szklarczyk D, Franceschini A, Wyder S et al. (2015). STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43: D447–D452. Takahashi K, Yamanaka S (2006). Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126: 663–676. The 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. The Gene Ontology Consortium (2015). Gene Ontology Consortium: going forward. Nucleic Acids Res 43: D1049–D1056. The International HapMap Consortium (2003). The International HapMap Project. Nature 426: 789–796. The UniProt Consortium (2015). UniProt: a hub for protein information. Nucleic Acids Res 43: D204–D212. Tryka KA, Hao L, Sturcke A et al. (2014). NCBI’s database of genotypes and phenotypes: dbGaP. Nucleic Acids Res 42: D975–D979. Uhlen M, Fagerberg L, Hallstr€ om BM et al. (2015). Tissuebased map of the human proteome. Science 347. Verbist B, Klambauer G, Vervoort L et al. (2015). Using transcriptomics to guide lead optimization in drug discovery projects: lessons learned from the QSTAR project. Drug Discov Today 20: 505–513.

92

J. CHEN AND G. COPPOLA

Voineagu I, Wang X, Johnston P et al. (2011). Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474: 380–384. Wang K, Li M, Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38: e164. Wang L-S, Valladares O, Childress DM et al. (2014). NIA Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS): 2014 update. Alzheimer’s & Dementia: The Journal of the Alzheimer’s Association 10: P634–P635. Welter D, MacArthur J, Morales J et al. (2014). The NHGRI GWAS catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42: D1001–D1006. Wilhelm M, Schlegl J, Hahne H et al. (2014). Mass-spectrometrybased draft of the human proteome. Nature 509: 582–587. Wu C, Orozco C, Boyer J et al. (2009). BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 10: R130.

Xu B, Ionita-Laza I, Roos JL et al. (2012). De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nat Genet 44: 1365–1369. Yang Y, Muzny DM, Reid JG et al. (2013). Clinical wholeexome sequencing for the diagnosis of Mendelian disorders. N Engl J Med 369: 1502–1511. Zambon AC, Gaj S, Ho I et al. (2012). GO-Elite: a flexible solution for pathway and ontology over-representation. Bioinformatics 28: 2209–2210. Zhou X, Lowdon RF, Li D et al. (2013). Exploring long-range genome interactions using the WashU epigenome browser. Nat Methods 10: 375–376. Zhu Q, Wong AK, Krishnan A et al. (2015). Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Meth 12: 211–214.