Chapter 10
Statistical genetic concepts in psychiatric genomics Darina Czamaraa and Divya Mehtab a
Department of Translational Psychiatry, Max Planck Institute of Psychiatry, Munich, Germany; b School of Psychology and Counselling, Faculty of
Health, Institute of Health and Biomedical Innovation, Queensland University of Technology, Kelvin Grove, QLD, Australia
1 Genetic studies Genetic studies can impact personalized medicine in different ways (Smoller, 2014): by identifying genetic variants associated with the disease, and hence pointing to new drug targets, by stratifying patients into different therapeutic subgroups, and by investigating if genetic variants are associated with response via pharmacogenetics studies. In general, different study types can be distinguished, which will be introduced in the following paragraphs.
1.1 Genetic linkage studies In the classical framework, linkage analysis was used for genetic mapping of Mendelian traits with familial aggregation, referring to diseases that run in families. In linkage studies, deviation from independent segregation is assessed. Therefore, pedigrees of large families with affected and nonaffected members are needed. Linkage itself is based on the concept that genes located in close physical proximity to each other remain linked during meiosis. Usually, a large number of VNTRs (variable number of tandem repeats) spanning across the whole genome are investigated. VNTRs are DNA-sequences that are repeated a variable number of times. About 30,000 VNTRs are known in the human genome (Tamiya et al., 2005). Linkage analysis assesses if these markers are inherited to affected offspring more often than would be expected by chance, and hence tests if the disease cosegregates with the chromosomal region. A measure of linkage is the LOD score, that is, the logarithm of the odds that the loci are linked divided by the odds that the loci are unlinked (Morton, 1955). Linkage peaks are regions with LOD-scores larger than 3.03 (Lander & Kruglyak, 1995), and hence with high probability that the actual risk locus is located near this region. Usually these peaks span large regions, which can be several megabases long. Therefore, fine-mapping of the specific regions with the use of a denser marker map needs to be performed to pinpoint the signal to a specific locus. Linkage analysis can be run in a single pedigree, or over multiple pedigrees where LOD-scores are summed up for each position. While the calculation of LOD-scores is straightforward in smaller families with complete segregating information, specific algorithms have to be applied to calculate linkage in bigger and more complicated or noncomplete pedigrees (Ott & Terwiliger, 1994). Several linkage studies have been performed for complex traits: for major depressive disorder, a number of studies have been conducted (see Levinson, 2006 for review). Also for bipolar disorder (see, for example, McQueen et al., 2005) and schizophrenia (e.g., Ng et al., 2009), several linkage peaks have been identified. In general, however, linkage studies of psychiatric disorders have yielded inconclusive results. The linkage method is best suited to identify rare mutations with large effects, but the genetic basis of psychiatric disorders is rather complex (Smoller, 2014). Therefore, the field has moved forward and focused more on association and genome-wide association studies (GWASs). These will be discussed in the next section. Nevertheless, linkage studies can identify specific genetic effects in families, which can have high impact within these families. Furthermore, they might be reconsidered in future studies, likely in conjunction with other methods, given that it has been hypothesized that rare variants are involved in the etiology of complex diseases (McClellan & King, 2010).
Personalized Psychiatry. https://doi.org/10.1016/B978-0-12-813176-3.00010-9 Copyright © 2020 Elsevier Inc. All rights reserved.
103
104 Personalized psychiatry
1.2 Genetic association studies While linkage refers to the relationship of loci, association refers to the relationship of alleles (Pulst, 1999). Genetic association studies investigate the relationship between an SNP (single nucleotide polymorphism)—genotype or an allele and a specific phenotype of interest. An SNP is a single-base difference at a specific site in the genome. Most often, the phenotype of interest is a dichotomous case-control status, hence affected individuals (cases) are compared with healthy control subjects. To be more precise, it is tested whether a specific allele is present more often in cases as compared with controls, or vice versa. This is usually performed using logistic regression analysis, which allows incorporation of possible confounding variables, such as age or gender as covariates in the analysis. In addition to the P-value, usually the odds ratio (OR) and the reference allele are also reported. The phenotype of interest can also be a quantitative trait, such as a depression scale, for example. Using quantitative traits might yield a clearer picture as compared with categorizing the samples into different groups, as the whole spectrum of the phenotype is analyzed, yielding better power. In this scenario, the main analysis tool is linear regression, which reports the effect size as the mean increase/decrease of the phenotype when a further copy of the reference allele is present. This analysis also allows the addition of covariates. Quantitative trait analysis can also be performed only in a specific subgroup. Uhr et al. (2008), for example, focused on treatment outcome in depressive patients and found that a specific SNP in the ABCB1 gene is associated with efficacy of a special group of antidepressants. Based on their findings, if patients present with the risk genotype, their drug treatment could be adapted, that is, by increasing the dose. This was a first step toward personalized medicine. A landmark paper showed that association analysis has more power to detect susceptibility loci as compared with linkage analysis if the loci exert only a small effect on the disease (Risch & Merikangas, 1996). Subsequently, the number of genetic studies has dramatically increased in the last decades (Lander et al., 2001). When a specific chromosomal region has already been identified as a risk locus for the disease (i.e., in a linkage study), a candidate gene analysis focusing only on SNPs that are located in the specific region can be conducted. If, however, no suitable candidates are available, usually at first an association analysis on the genome-wide level is performed. This method is unbiased and currently more cost-effective as compared with designing a highly customized array only containing specific candidate SNPs.
1.3 Genome-wide association studies Due to the fast progressing development of new technologies, in the past decades, SNP-genotyping on the whole-genome level became not only available, but also cheaper and quicker. Nowadays, up to a million SNP-genotypes can be assessed using high-throughput methods. Thus, GWASs are an ideal tool to study complex traits in which a high number of susceptibility loci are involved. In GWASs, the association between a phenotype of interest and genome-wide SNP-genotypes is assessed. Results are often represented in a Manhattan plot (see Fig. 1) where resulting association P-values, ordered according to their position in the genome, are depicted. These plots are a graphical summary of all association findings, and the top hits, that is, the SNPs with the lowest P-values, can easily be identified and put into regional context with each other. Usually, P-values below the threshold of 5 1008 are considered as genome-wide significant, which is based on the assumption that about 1,000,000 independent SNPs are present in the human genome in Europeans (Pe’er, Yelensky, Altshuler, & Daly, 2008). In the beginning of the GWAS-era, the common disease-common variant hypothesis was postulated (Reich & Lander, 2001), which posits that complex diseases are caused by common alleles at a few susceptibility loci. Indeed, susceptibility genes for several complex traits, such as type I diabetes (Wellcome Trust Case Control Consortium, 2007), type II diabetes (Sladek et al., 2007), and breast cancer (Easton et al., 2007), could be identified. Nevertheless, for other diseases, this approach was not so successful. Based on this observation, it is more likely that complex diseases are better predicted by a polygenic model. The multilocus-multiallele hypothesis (for review see Graham & Vyse, 2005) states that multiple susceptibility alleles and environmental exposures contribute to complex diseases. Indeed, if many markers are involved, each of these exhibiting only a small effect size, this also implies that large sample sizes of several thousand individuals are required to detect these effects. Generally, the larger the sample sizes were, the more association hits were found (Visscher, Brown, McCarthy, & Yang, 2012). A consequence of this was that a large number of consortia was established, which made it possible to gather larger numbers of individual studies, combined into an overall large sample size. Usually, metaanalysis is conducted to combine statistical results across individual studies. The Psychiatric Genomics Consortium (Sullivan, 2010) is one of the largest consortiums today, including more than 900,000 samples for several disorders, such as schizophrenia, bipolar disorder, major depression, or ADHD.
Statistical genetic concepts in psychiatric genomics Chapter 10
105
FIG. 1 A Manhattan plot depicting genome-wide association resulting P-values, ordered according to their position in the genome.
Several genome-wide significant susceptibility loci were identified for a variety of complex traits. In meta-analysis including over 130,000 MDD cases, for example, and over 330,000 controls, 14 genome-wide hits were found (Wray et al., 2017).
1.4 CNV analysis Another important part of the human genome are copy number variations (CNVs). They occur if chromosomal regions present multiple times, the number of copies differing between individuals (Iafrate et al., 2004; Sebat et al., 2004). Although about three times more SNPs than CNVs are present in the human genome, the relative contribution to genomic variation (as measured in nucleotides) is comparable (Malhotra & Sebat, 2012). Roughly 12% of our genome is affected by copy number change (Redon et al., 2006). While CNVs can also be analyzed by themselves, one major approach is to calculate the so-called burden of rare variants (i.e., the number of CNVs carried by an individual). In the Autism Genome Project, for example, Pinto et al. (2010) found that cases had a higher burden of rare CNVs as compared with healthy controls.
1.5 Family-based association studies While family-based studies are mainly used for linkage analysis, association analysis can also be performed in families. One of the most applied approaches is the transmission disequilibrium test (TDT; Spielman, McGinnis, & Ewens, 1993). This method is used in trio data, where genotypes of affected offspring and their parents are investigated. The TDT measures the overtransmission of an allele from heterozygous parents to affected offspring. If the marker is not associated to the disease, the transmission rate of a specific allele should vary around 0.5. The TDT assesses if the actual transmission rate differs significantly from this value, and hence, if an association between disease and marker is present. Several studies looking at familial association with psychiatric diseases using TDTs have been published (e.g., Curran et al., 2006; Wei & Hemmings, 2000). One advantage is that each family serves as its own control, hence the TDT is robust to population
106 Personalized psychiatry
stratification that arises if allele frequencies differ among ethnicities (see also Section 2). A disadvantage of this approach is that at least one parent has to be heterozygous for the marker. Furthermore, family-based association studies generally present with lower power, as compared with case-control studies (Risch & Merikangas, 1996). Several extensions of the TDT, including FBAT (family-based association test, see, for example, Laird & Lange, 2008), allowing, for example, for the inclusion of unaffected siblings, have been proposed.
1.6 Twin studies Monozygotic (MZ) twins are genetically identical. Therefore, any differences between MZ twins hint at environmental differences (Plomin, DeFries, McClearn, & Rutter, 1997). Dizygotic (DZ) twins share, on average, 50% of the genome. Hence, any differences between DZ twins might be due to genetics and/or environment. Studying twins is based on the comparison of correlation within MZ twins and DZ twins. If the correlation in MZ twins is higher as compared with DZ twins, a certain genetic component is implied.
2
Genetic architecture
2.1 Population stratification Population stratification occurs due to the fact that allele frequencies differ between ethnicities. Combining different ethnicities in an association study without any correction might give false-positive results. In the most extreme scenario, cases have a different ethnic background than controls. In this situation, we cannot disentangle if significant associations are due to differences in allele frequencies between cases and controls, or between ethnicities in general. However, in most studies, there is a certain ethnic diversity in cases as well as in controls. Different methods can be used to correct for population stratification in this scenario. Devlin and Roeder (1999) proposed the genomic control procedure. They define the genomic inflation factor λ as median of the association test statistics divided by the theoretical median under the null hypothesis of no association. Resulting values larger than 1 are indicators for population stratification or other confounders such as cryptic relatedness (Price, Zaitlen, Reich, & Patterson, 2010). Dividing the test statistics with λ and deriving P-values based on these statistics can then be performed to correct for population stratification. One disadvantage of this approach, however, is that, as all P-values are corrected with the same value, a constant inflation across the whole genome is assumed, which does not necessarily need to be true. A different approach performs principal-component analysis or multidimensional scaling on the genome-wide SNPs. Doing so, main axes of variations can be identified and used as covariates in subsequent analyses. However, these axes do not always reflect population heterogeneity, they could also depict long-range LD (Tian et al., 2008) or familial relatedness (Patterson, Price, & Reich, 2006), for example. Therefore, linear mixed models, which can model stratification as well as cryptic relatedness and family structure have been proposed (Price et al., 2010).
2.2 Imputation of missing genotypes High-throughput technology made large-scale whole-genome genotyping feasible and cost-effective, however, not all SNPs are covered on genotyping arrays. We can make use of the fact that the human genome is structured in linkagedisequilibrium (LD)-blocks, which means that not all SNPs are totally independent of each other. Nowadays, large samples from different ethnicities that have been sequenced are available (Genomes Project Consortium et al., 2010; International HapMap Consortium, 2003). We can assess the LD-pattern in these samples and use these to impute missing genotypes in our own samples (Marchini, Howie, Myers, McVean, & Donnelly, 2007). The imputation technique is a standard method, and it has been shown that the prediction works quite accurately (Howie, Donnelly, & Marchini, 2009).
2.3 Heritability, genetic correlation, and polygenic risk scores Genome-wide genotype data has enabled direct estimation of additive heritability attributable to common genetic variation (“SNP heritability” or h2SNP) (Lee, Wray, Goddard, & Visscher, 2011; Yang et al., 2010), and aggregation of these variants into a single empirical polygenic risk score (Ruderfer et al., 2014). The polygenic risk score method takes statistically independent genetic variants from a GWAS (discovery or training set), ranks the results by the P-value of significance, and applies the weights from these to an independent target sample (Wray et al., 2014). The statistical power for such analyses depends on the training set, and this power can be harnessed to test into smaller target samples.
Statistical genetic concepts in psychiatric genomics Chapter 10
107
P Polygenic risk score ¼ xi log (ORi), where ORi is the allelic odds ratio in the discovery dataset and xi is the number of risk alleles present at a single genetic locus in each individual (Iyegbe, Campbell, Butler, Ajnakina, & Sham, 2014). The allelic log (odds ratio) for the association of SNPi vs trait is multiplied by the number of risk alleles. This process is repeated for each SNP within a specified P-value range. The aggregate count of weighted risk alleles creates a polygenic score per individual. The distribution of score values is normalized by fitting to a standard normal distribution curve, which helps with the interpretation of the score in downstream analyses. The polygenic method has been widely used, both within and across psychiatric disorders analyses (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013; International Schizophrenia Consortium et al., 2009). The large number of null GWAS effects encouraged common practice to present the change in the association statistic at different inclusion thresholds, as provided by programs such as PRSice (Euesden, Lewis, & O’Reilly, 2015) and PLINK (Chang et al., 2015; Purcell et al., 2007). Another method called genome-wide complex trait analysis (GCTA) utilizes the genetic variant information from GWAS to assess the degree of genetic relatedness between individuals, assuming that cases are genetically more similar to each other than to controls (Yang, Lee, Goddard, & Visscher, 2011). The genetic relationship matrix is then correlated with dichotomous or quantitative phenotypes. Most of the preceding methods are based on genetic restricted maximum likelihood analysis, and are implemented in software packages such as GCTA and LDAK (Lee et al., 2011; Speed, Hemani, Johnson, & Balding, 2012; Yang et al., 2010, 2011). An extension of these methods estimates genetic correlation (rgSNP) explained by GWAS SNPs between two disorders (Lee, Yang, Goddard, Visscher, & Wray, 2012), with a positive correlation indicating that the cases of one disorder show higher genetic similarity to the cases of the other disorder than to their own controls. When the covariance term between the traits is similar in magnitude to the variance terms, a high correlation (rgSNP) is obtained. This method reports SNP-based coheritability across pairs of disorders. A disadvantage of these methods is that they are computationally intensive and require individual-level genotype data. Cross-trait LD score regression allows us to estimate genetic correlation using only summary statistics from GWAS (Bulik-Sullivan, Loh, et al., 2015). This computationally fast method does not require individual genotypes, genome-wide significant SNPs or LD-pruning, that is, no independent genetic variants. The approach harnesses the LD information under the assumption that if a trait is genetically influenced, then variants in high LD would tag causal variants, and have higher test statistics than variants with low LD. This method is flexible and can be adapted to estimate SNP heritability (BulikSullivan, Loh, et al., 2015), partition SNP heritability by functional categories (Finucane et al., 2015), and estimate genetic correlation between different complex traits (Bulik-Sullivan, Finucane, et al., 2015). Methods to calculate LD score regression include the publicly available Python code (https://github.com/bulik/ldsc/wiki/Heritability-and-GeneticCorrelation) and LD Hub web interface and centralized database of summary-level GWAS results for 173 diseases/traits (http://ldsc.broadinstitute.org/). The LD Hub allows for calculation of heritability and genetic correlation via LD score regression analysis of user GWAS data against these traits (Zheng et al., 2017). These methods will pave the way to gain more insights into the genetic architecture of psychiatric disorders.
3 Gene-environment interactions Gene-by-environment interactions (G E) studies assess environmental effects on a phenotype that differ depending on the genetic background. A G E is identified when the risk for the disorder, if exposed to both the risk gene (G) and risk environment (E) differs from the sum (additive model) or the product (multiplicative model) of the risks, compared with exposure to only the G or E (Sharma, Powers, Bradley, & Ressler, 2016). Methods modeling interactions range from conventional approaches to exploratory novel methods. Analysis of single gene G E is straightforward, results are presented in 2 2 or 3 2 tables of relative risks, of the environmental exposure in the risk and nonrisk genotype groups (Little et al., 2002). Modeling both main effects and G E interactions via a two degree of freedom joint test is preferable over the traditional approach of main effect testing followed by testing for an interaction conditional on the main effects (Kraft, Yen, Stram, Morrison, & Gauderman, 2007). An overview of common methods used to detect G E is presented in Table 1, these methods can also be implemented for gene-gene (G G) interactions (Chen, Liu, Zhang, & Zhang, 2007; Chen et al., 2008; Cook et al., 2004; Culverhouse et al., 2004; Hahn et al., 2003; Millstein et al., 2006; Moore & Hahn, 2002; Moore et al., 2004; Nelson et al., 2001; Park & Hastie, 2008; Ritchie et al., 2003; Strobl et al., 2009; Tomita et al., 2004; Zhang & Liu, 2007). Methods include conventional parametric logistic regression driven approaches or Cox regression models where multiple confounders can be adjusted for (Thomas, 2010). Haplotypes incorporating multi-SNP genetic risk can be tested in interactions with environmental exposure (haplotype E), as an extension of the G E analysis. While most G E studies have been conducted on hypothesis-driven candidate genes, recently focus has shifted to using the polygenic risk score for a disorder as the G in the G E (Arloth et al., 2015; Mullins et al., 2016; Peyrot et al., 2014).
108 Personalized psychiatry
TABLE 1 An overview of methods used to detect G × E. Method type
Name
Reference
Tree-based methods
Multivariate adaptive regression (MARS)
Cook, Zee, and Ridker (2004)
Random forests
Lunetta, Hayward, Segal, and Van Eerdewegh (2004)
Classification and regression trees (CART)
Strobl, Malley, and Tutz (2009)
Multifactor dimensionality reduction (MDR)
Hahn, Ritchie, and Moore (2003)
Focused interaction testing framework
Millstein, Conti, Gilliland, and Gauderman (2006)
Combinatorial partitioning method
Nelson, Kardia, Ferrell, and Sing (2001)
Restricted partitioning method
Culverhouse, Klein, and Shannon (2004)
Support vector machines (SVMs)
Chen et al. (2008)
Penalized regression
Park and Hastie (2008)
Bayesian methods
Zhang and Liu (2007)
Parameter decreasing method (PDM)
Tomita et al. (2004)
Genetic programming optimized neural network (GPNN)
Ritchie, White, Parker, Hahn, and Moore (2003)
Genetic algorithm strategies
Moore, Hahn, Ritchie, Thornton, and White (2004)
Cellular automata (CA) approach
Moore and Hahn (2002)
Data reduction
Pattern recognition and data mining
Resampling-based tests derived from the Bayesian models (Wakefield, Haneuse, Dobra, & Teeple, 2011; Yu et al., 2012) allow testing of complicated gene-environment interactions, where genetic variants in multiple loci within the region interact with the environmental risk factor(s). Tree-based methods such as classification and regression trees (Breiman, Friedman, Olshen, & Stone, 1984) and random forests (Lunetta et al., 2004) can also deal with many parameters, but cannot account for available prior information, and hence have limited value. Unbiased approaches include genome-wide gene by environment interaction studies (GWEIs) (Dunn et al., 2016), although GWEIs require larger samples, and have several statistical complications (Almli et al., 2014). Given the “multidimensionality curse” of assessing high-dimensional interactions, dimension-reduction methods have recently been favored. Multidimensional interactions can be reliably explored via tools such as the Multifactor Dimension Reduction (Edwards, Lewis, Velez, Dudek, & Ritchie, 2009), which scans across the multiway contingency table to derive the optimal classifier of disease risk based on multiple training sets, and tests their predictions on the remaining data. Similarly, the Focused Interaction Testing Framework (Millstein et al., 2006) builds through sequences of main effects and higher order interactions. Such methods allow more highdimensional modeling for diseases involving multiple genes, multiple environmental risk factors, and epistatic (G G) interactions. G E studies face many methodological challenges (Duncan & Keller, 2011; Karg & Sen, 2012; Mehta & Binder, 2012). As the number of factors per model increase, there is an exponential increase in the computational time and number of hypotheses tested. Failure to control for all covariate interactions, including gene-by-covariate and environment-bycovariate interaction terms, can result in spurious interactions (Keller, 2014). Hybrid methods avoid the potential genetic-environment correlation bias (Dai et al., 2012) while combining the estimates derived from the case-control and case-only designs (Mukherjee & Chatterjee, 2008). Finally, the power required to detect G E associations is very high, with an interaction requiring at least a fourfold larger sample size than a main effect of comparable magnitude (Smith & Day, 1984) to detect even moderate effect sizes (Murcray, Lewinger, Conti, Thomas, & Gauderman, 2011). Case-only studies that assume G-E independence among controls (Weinberg & Umbach, 2000) and two-stage design tests using use a case-only comparison for screening, followed by a family-based comparison that does not require G-E independence (Chen, Lin, & Liu, 2009) are complementary approaches aimed at gaining power to detect interactions.
Statistical genetic concepts in psychiatric genomics Chapter 10
109
Future G E studies in larger, deep-phenotyped, longitudinal cohorts, assessing for multiple genetic and environmental risk factors across different developmental stages are warranted.
4 Gene expression and DNA methylation analysis Statistical analysis of gene expression and DNA methylation can be performed at individual gene/locus or at the genomewide level. Candidate gene analysis is generally performed using quantitative real-time PCR (qPCR) for gene expression and pyrosequencing, or Sequenom Epityper for DNA methylation. The data output can be analyzed using simple analysis tools such as Excel and SPSS. Complex high-throughput genome-wide analysis is performed via microarrays (e.g., Illumina, Agilent, and Affymetrix), or deep sequencing using next-generation applications such as RNA-seq or Bisulfite-seq. Open-source, flexible R, and Bioconductor environments provide free statistical packages with prebuilt functions, workflows, and documentation that can be customized (Huber et al., 2015). The main steps for analyzing DNA methylation and gene expression microarray data are similar (Fig. 2), and have been previously described (Dedeurwaerder et al., 2014; Michels et al., 2013; Wright et al., 2016). Following are the basic steps for analysis of RNA sequencing data, similar approaches are available for analysis of Bisulfite sequencing data (Akman, Haaf, Gravina, Vijg, & Tresch, 2014; Rackham et al., 2017; Wreczycka et al., 2017): (a) Assembly: Aligning the genomic reads to a reference genome and analysis of the number of reads that map to specific regions/windows to determine the abundance levels of genes, transcripts, enhancers, or other sequence intervals. Common programs include BWA (Li & Durbin, 2009), Bowtie (Langmead, Trapnell, Pop, & Salzberg, 2009), and SOAP (Li, Li, Kristiansen, & Wang, 2008). Bioconductor packages include RSamtools and GenomicAlignments. (b) Filtering: Removal of regions with low coverage using a cutoff is performed to ensure good read depth for downstream analysis. (c) Normalization: Reads per kilobase per million reads is a simple read coverage adjustment that considers gene counts standardized by the gene length and the total number of reads in each library as expression values (Mortazavi, Williams, McCue, Schaeffer, & Wold, 2008). Similarly, FPKM, or fragments per kilobase per million (Trapnell et al., 2010), adjusts for the total number of reads, or fragments mapped per kilobase, per million mapped reads. (d) Differential expression/methylation analysis: Differential gene expression or DNA methylation is performed via packages such as edgeR (Nikolayeva & Robinson, 2014; Robinson, McCarthy, & Smyth, 2010), limma (Diboun, Wernisch, Orengo, & Koltzenburg, 2006; Ritchie et al., 2015), and DESeq2 (Love, Huber, & Anders, 2014; Varet, Brillet-Gueguen, Coppee, & Dillies, 2016). These methods have been reviewed in detail (Law, Alhamdoosh, Su, Smyth, & Ritchie, 2016). Integrated pipelines are also available that perform all the data analysis steps for RNA-seq (Gao et al., 2015; Lim, Lee, & Kim, 2017; Torres-Garcia et al., 2014) and Bisulfite-seq ( Jiang et al., 2014; LoVerso & Cui, 2015) data. Technical constraints in data analysis exist because many samples are assessed and are generally processed in batches, which might cause systematic errors due to nonbiological effects. Several methods correct for these (Chen et al., 2011; Kupfer et al., 2012), including Combat, which is an empirical Bayes method that adjusts for batch effects ( Johnson, Li, & Rabinovic, 2007), and surrogate variable analysis (SVA) (Leek & Storey, 2007), which uses a linear model analysis
FIG. 2 Workflow outlining the main steps for analyzing DNA methylation and gene expression microarray data.
110 Personalized psychiatry
to estimate eigenvalues from a residual expression matrix from which biological variation has already been removed. Combat corrects for known batch effects, while SVA corrects for known and unknown confounds, making the method far superior, but highly conservative. Other technical artifacts including skewedness and heteroscedasticity can be overcome by log transformation of the data. Selected scientific journals have made it mandatory for authors to deposit their gene expression and DNA methylation data on websites such as Gene Expression Omnibus or ArrayExpress. Moreover, larger international consortia such as the Psychiatric Genomics Consortium also have gene expression and epigenomics working groups, which allow integration of these data within and across different cohorts.
5
Gene enrichment and network analysis
High-throughput technologies generate large numbers of genes with differences in genetic sequence or gene activity between groups at a genome-wide scale. The list of genes obtained is often difficult to interpret, hence pathway and network-based methods are used to aggregate these genetic variants and/or genes based on shared function and biology. Computational methods have been developed to predict and prioritize genetic variants and genes based on functional annotations, and these methods have been explicitly reviewed (Kao, Leung, Chan, Yip, & Yap, 2017). At the genetic sequence level, gene set analysis allows us to prioritize loci where multiple SNPs show evidence of association with a trait. Gene set analysis methods bypass stringent multiple-testing corrections needed for analysis of individual signals, and can account for gene size, SNP density, and LD information across the genome, thereby increasing the power of the study. Pathway-based genome-wide association analysis tools include INRICH (INterval enRICHment analysis) that test for enriched association signals of predefined gene sets across independent genomic intervals (Lee, O’Dushlaine, Thomas, & Purcell, 2012), and MAGENTA (Segre et al., 2010), a computational tool that tests for enrichment of genetic associations in predefined biological processes or sets of functionally related genes. Outputs are gene set enrichment analysis P-values and a false discovery rate that corrects for each tested gene set or pathway. For biological measures of gene activity, such as gene expression and DNA methylation, public databases with preavailable gene sets can be used for pathway analyses and functional annotation. These include the Pathway Commons database (Cerami et al., 2011), the Kyoto Encyclopedia of Genes and Genomes (Kanehisa & Goto, 2000), Gene Ontology (Ashburner et al., 2000), DAVID (da Huang, Sherman, & Lempicki, 2009; Huang et al., 2007), and Molecular Signatures Database (Liberzon et al., 2011). Protein-protein interaction (PPI) networks can also be used to define gene sets by selecting genes that interact at the protein level by plotting direct PPI neighbors and the interactions between these proteins to visualize large protein networks. Software to build PPIs include STRING (Szklarczyk et al., 2015) and PINA (Cowley et al., 2012). Weighted gene coexpression network analysis (WGCNA) focuses on exploring correlation between probes in gene expression data and comparing the gene networks to clinical data (Langfelder & Horvath, 2008). WGCNA reflects the notion that genes within the same biological pathway will have similar patterns of expression, and has been successfully used to unravel orchestrated networks of genes in psychiatric disorders (Breen et al., 2017; de Jong et al., 2016; Kim, Hwang, Webster, & Lee, 2016). First, genes are clustered based on the dissimilarity measure based on average linkage hierarchical clustering. Next, gene coexpression modules are detected by applying a branch cutting method. Eigen-values for coexpression networks can finally be integrated with external information, such as phenotypes and covariates, to identify networks associated with specific traits. Pathway-based methods pose several shortcomings, including intrinsic redundancy, whereby gene sets might only have a partial overlap, but are tagged by the same biological processes at the annotation levels. Furthermore, genes within a defined gene set do not always have coherent biological activity, hence there is heterogeneity within a gene set that might not be reflected by combining these together within the curated sets. Network-based methods also allow a systems approach by integration of results across different types of data, in conjunction with clinical and biological information, to identify genes associated with psychiatric disorders. This in turn will allow elucidation of coordinated gene activity of sets of biologically annotated genes at a broader network level, and offer deeper insight into disease mechanisms and biology of psychiatric disorders.
6
Conclusions and future directions
Psychiatry has entered the era of “Big Data” where vast amounts of clinical, genomic, transcriptomic, brain imaging, endocrine, environmental, and other types of data are routinely assessed. Analyses of these data and integrating them across different strata is a challenging task, therefore the establishment and implementation of appropriate statistical methods are key to the interpretation of the data.
Statistical genetic concepts in psychiatric genomics Chapter 10
111
In this chapter, we have introduced and described some of the current statistical methods that are widely used in genomics studies of complex traits. Some of these methods can be further tuned to psychiatric disorders, by incorporating environmental risk factors with genetic risk factors, thereby providing an avenue for identification of individual biological and genetic risk markers for disease. The “one-size-fits-all” notion has been shown to be ineffective for psychiatric disorders, as evidenced by increased rates of relapses, treatment-resistant disorders, and adverse reaction to psychiatric drugs; hence it is clear that psychiatry will benefit enormously from tailored treatments. Utilization of the statistical methods outlined in this chapter will allow us to integrate knowledge from different types of biological and clinical information, and usage of this combined information will facilitate improved diagnosis classification and personalized psychiatric interventions. Future longitudinal studies aimed at collection and analysis of large, well-characterized patient information and incorporating clinical data with observed biological measures will help uncover disease processes and trajectories. By understanding how a disease progresses, predictive statistical models can be built to identify high-risk individuals and monitor them closely, subsequently allowing for timely intervention. In conclusion, by leveraging simple and advanced statistical methods to interrogate patient data, researchers and clinicians will be able to get a deeper insight into the etiology of the psychiatric disorder, an important step toward the goal of personalized psychiatry.
References Akman, K., Haaf, T., Gravina, S., Vijg, J., & Tresch, A. (2014). Genome-wide quantitative analysis of DNA methylation from bisulfite sequencing data. Bioinformatics, 30(13), 1933–1934. https://doi.org/10.1093/bioinformatics/btu142. Almli, L. M., Duncan, R., Feng, H., Ghosh, D., Binder, E. B., Bradley, B., … Epstein, M. P. (2014). Correcting systematic inflation in genetic association tests that consider interaction effects: Application to a genome-wide association study of posttraumatic stress disorder. JAMA Psychiatry, 71(12), 1392–1399. https://doi.org/10.1001/jamapsychiatry.2014.1339. Arloth, J., Bogdan, R., Weber, P., Frishman, G., Menke, A., Wagner, K. V., … Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium PGC. (2015). Genetic differences in the immediate transcriptome response to stress predict risk-related brain function and psychiatric disorders. Neuron, 86(5), 1189–1202. https://doi.org/10.1016/j.neuron.2015.05.034. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., … Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics, 25(1), 25–29. https://doi.org/10.1038/75556. Breen, M. S., Tylee, D. S., Maihofer, A. X., Neylan, T. C., Mehta, D., Binder, E., … Glatt, S. J. (2017). PTSD blood transcriptome mega-analysis: Shared inflammatory pathways across biological sex and modes of trauma. Neuropsychopharmacology. https://doi.org/10.1038/npp.2017.220. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International. Bulik-Sullivan, B., Finucane, H. K., Anttila, V., Gusev, A., Day, F. R., Loh, P. R., … Neale, B. M. (2015). An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47(11), 1236–1241. https://doi.org/10.1038/ng.3406. Bulik-Sullivan, B. K., Loh, P. R., Finucane, H. K., Ripke, S., Yang, J., Schizophrenia Working Group of the Psychiatric Genomics Consortium, … Neale, B. M. (2015). LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47(3), 291–295. https://doi.org/10.1038/ng.3211. Cerami, E. G., Gross, B. E., Demir, E., Rodchenkov, I., Babur, O., Anwar, N., … Sander, C. (2011). Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research, 39(Database issue), D685–D690. https://doi.org/10.1093/nar/gkq1039. Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience, 4, 7. https://doi.org/10.1186/s13742-015-0047-8. Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., & Liu, C. (2011). Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS One, 6(2). https://doi.org/10.1371/journal.pone.0017238. Chen, Y. H., Lin, H. W., & Liu, H. (2009). Two-stage analysis for gene-environment interaction utilizing both case-only and family-based analysis. Genetic Epidemiology, 33(2), 95–104. https://doi.org/10.1002/gepi.20357. Chen, X., Liu, C. T., Zhang, M., & Zhang, H. (2007). A forest-based approach to identifying gene and gene gene interactions. Proceedings of the National Academy of Sciences of the United States of America, 104(49), 19199–19203. https://doi.org/10.1073/pnas.0709868104. Chen, S. H., Sun, J., Dimitrov, L., Turner, A. R., Adams, T. S., Meyers, D. A., … Hsu, F. C. (2008). A support vector machine approach for detecting genegene interaction. Genetic Epidemiology, 32(2), 152–167. https://doi.org/10.1002/gepi.20272. Cook, N. R., Zee, R. Y., & Ridker, P. M. (2004). Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Statistics in Medicine, 23(9), 1439–1453. https://doi.org/10.1002/sim.1749. Cowley, M. J., Pinese, M., Kassahn, K. S., Waddell, N., Pearson, J. V., Grimmond, S. M., … Wu, J. (2012). PINA v2.0: Mining interactome modules. Nucleic Acids Research, 40(Database issue), D862–D865. https://doi.org/10.1093/nar/gkr967. Cross-Disorder Group of the Psychiatric Genomics Consortium. (2013). Identification of risk loci with shared effects on five major psychiatric disorders: A genome-wide analysis. Lancet, 381(9875), 1371–1379. https://doi.org/10.1016/S0140-6736(12)62129-1. Culverhouse, R., Klein, T., & Shannon, W. (2004). Detecting epistatic interactions contributing to quantitative traits. Genetic Epidemiology, 27(2), 141–152. https://doi.org/10.1002/gepi.20006.
112 Personalized psychiatry
Curran, S., Powell, J., Neale, B. M., Dworzynski, K., Li, T., Murphy, D., & Bolton, P. F. (2006). An association analysis of candidate genes on chromosome 15 q11-13 and autism spectrum disorder. Molecular Psychiatry, 11(8), 709–713. https://doi.org/10.1038/sj.mp.4001839. da Huang, W., Sherman, B. T., & Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57. https://doi.org/10.1038/nprot.2008.211. Dai, J. Y., Logsdon, B. A., Huang, Y., Hsu, L., Reiner, A. P., Prentice, R. L., & Kooperberg, C. (2012). Simultaneously testing for marginal genetic association and gene-environment interaction. American Journal of Epidemiology, 176(2), 164–173. https://doi.org/10.1093/aje/kwr521. Dedeurwaerder, S., Defrance, M., Bizet, M., Calonne, E., Bontempi, G., & Fuks, F. (2014). A comprehensive overview of Infinium HumanMethylation450 data processing. Briefings in Bioinformatics, 15(6), 929–941. https://doi.org/10.1093/bib/bbt054. de Jong, S., Newhouse, S. J., Patel, H., Lee, S., Dempster, D., Curtis, C., … Breen, G. (2016). Immune signatures and disorder-specific patterns in a crossdisorder gene expression analysis. The British Journal of Psychiatry, 209(3), 202–208. https://doi.org/10.1192/bjp.bp.115.175471. Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. Diboun, I., Wernisch, L., Orengo, C. A., & Koltzenburg, M. (2006). Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics, 7, 252. https://doi.org/10.1186/1471-2164-7-252. Duncan, L. E., & Keller, M. C. (2011). A critical review of the first 10 years of candidate gene-by-environment interaction research in psychiatry. The American Journal of Psychiatry, 168(10), 1041–1049. https://doi.org/10.1176/appi.ajp.2011.11020191. Dunn, E. C., Wiste, A., Radmanesh, F., Almli, L. M., Gogarten, S. M., Sofer, T., … Smoller, J. W. (2016). Genome-wide association study (GWAS) and genome-wide by environment interaction study (GWEIS) of depressive symptoms in African American and Hispanic/Latina women. Depression and Anxiety, 33(4), 265–280. https://doi.org/10.1002/da.22484. Easton, D. F., Pooley, K. A., Dunning, A. M., Pharoah, P. D., Thompson, D., Ballinger, D. G., … Ponder, B. A. (2007). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447(7148), 1087–1093. https://doi.org/10.1038/nature05887. Edwards, T. L., Lewis, K., Velez, D. R., Dudek, S., & Ritchie, M. D. (2009). Exploring the performance of Multifactor Dimensionality Reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models. Human Heredity, 67(3), 183–192. https://doi.org/ 10.1159/000181157. Euesden, J., Lewis, C. M., & O’Reilly, P. F. (2015). PRSice: Polygenic Risk Score software. Bioinformatics, 31(9), 1466–1468. https://doi.org/10.1093/ bioinformatics/btu848. Finucane, H. K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P. R., … Price, A. L. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 47(11), 1228–1235. https://doi.org/10.1038/ng.3404. Gao, S., Zou, D., Mao, L., Zhou, Q., Jia, W., Huang, Y., … Sorensen, K. D. (2015). SMAP: A streamlined methylation analysis pipeline for bisulfite sequencing. Gigascience, 4, 29. https://doi.org/10.1186/s13742-015-0070-9. Genomes Project Consortium, Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., … McVean, G. A. (2010). A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061–1073. https://doi.org/10.1038/nature09534. Graham, D. S. C., & Vyse, T. J. (2005). The common disease common variant concept. In Encyclopedia of genetics, genomics, proteomics and bioinformatics. Wiley. https://doi.org/10.1002/047001153X.g105205. Hahn, L. W., Ritchie, M. D., & Moore, J. H. (2003). Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19(3), 376–382. Howie, B. N., Donnelly, P., & Marchini, J. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics, 5(6). https://doi.org/10.1371/journal.pgen.1000529. Huang, D. W., Sherman, B. T., Tan, Q., Kir, J., Liu, D., Bryant, D., … Lempicki, R. A. (2007). DAVID Bioinformatics Resources: Expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Research, 35, W169–W175. https://doi.org/10.1093/nar/ gkm415 (Web Server issue). Huber, W., Carey, V. J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B. S., … Morgan, M. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods, 12(2), 115–121. https://doi.org/10.1038/nmeth.3252. Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., … Lee, C. (2004). Detection of large-scale variation in the human genome. Nature Genetics, 36(9), 949–951. https://doi.org/10.1038/ng1416. International HapMap Consortium. (2003). The International HapMap Project. Nature, 426(6968), 789–796. https://doi.org/10.1038/nature02168. International Schizophrenia Consortium, Purcell, S. M., Wray, N. R., Stone, J. L., Visscher, P. M., O’Donovan, M. C., … Sklar, P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256), 748–752. https://doi.org/10.1038/nature08185. Iyegbe, C., Campbell, D., Butler, A., Ajnakina, O., & Sham, P. (2014). The emerging molecular architecture of schizophrenia, polygenic risk scores and the clinical implications for GxE research. Social Psychiatry and Psychiatric Epidemiology, 49(2), 169–182. https://doi.org/10.1007/s00127-014-0823-2. Jiang, P., Sun, K., Lun, F. M., Guo, A. M., Wang, H., Chan, K. C., … Sun, H. (2014). Methy-Pipe: An integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis. PLoS One, 9(6). https://doi.org/10.1371/journal.pone.0100360. Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), 118–127. https://doi.org/10.1093/biostatistics/kxj037. Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1), 27–30. Kao, P. Y., Leung, K. H., Chan, L. W., Yip, S. P., & Yap, M. K. (2017). Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochimica et Biophysica Acta, 1861(2), 335–353. https://doi.org/10.1016/j.bbagen.2016.11.030. Karg, K., & Sen, S. (2012). Gene x environment interaction models in psychiatric genetics. Current Topics in Behavioral Neurosciences, 12, 441–462. https://doi.org/10.1007/7854_2011_184.
Statistical genetic concepts in psychiatric genomics Chapter 10
113
Keller, M. C. (2014). Gene x environment interaction studies have not properly controlled for potential confounders: The problem and the (simple) solution. Biological Psychiatry, 75(1), 18–24. https://doi.org/10.1016/j.biopsych.2013.09.006. Kim, S., Hwang, Y., Webster, M. J., & Lee, D. (2016). Differential activation of immune/inflammatory response-related co-expression modules in the hippocampus across the major psychiatric disorders. Molecular Psychiatry, 21(3), 376–385. https://doi.org/10.1038/mp.2015.79. Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J., & Gauderman, W. J. (2007). Exploiting gene-environment interaction to detect genetic associations. Human Heredity, 63(2), 111–119. https://doi.org/10.1159/000099183. Kupfer, P., Guthke, R., Pohlers, D., Huber, R., Koczan, D., & Kinne, R. W. (2012). Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis. BMC Medical Genomics, 5, 23. https://doi.org/10.1186/17558794-5-23. Laird, N. M., & Lange, C. (2008). Family-based methods for linkage and association analysis. Advances in Genetics, 60, 219–252. https://doi.org/10.1016/ S0065-2660(07)00410-5. Lander, E., & Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nature Genetics, 11(3), 241–247. https://doi.org/10.1038/ng1195-241. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860–921. https://doi.org/10.1038/35057062. Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559. https://doi.org/ 10.1186/1471-2105-9-559. Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. https://doi.org/10.1186/gb-2009-10-3-r25. Law, C. W., Alhamdoosh, M., Su, S., Smyth, G. K., & Ritchie, M. E. (2016). RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res, 5, 1408. https://doi.org/10.12688/f1000research.9005.2. Lee, P. H., O’Dushlaine, C., Thomas, B., & Purcell, S. M. (2012). INRICH: Interval-based enrichment analysis for genome-wide association studies. Bioinformatics, 28(13), 1797–1799. https://doi.org/10.1093/bioinformatics/bts191. Lee, S. H., Wray, N. R., Goddard, M. E., & Visscher, P. M. (2011). Estimating missing heritability for disease from genome-wide association studies. American Journal of Human Genetics, 88(3), 294–305. https://doi.org/10.1016/j.ajhg.2011.02.002. Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M., & Wray, N. R. (2012). Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics, 28(19), 2540–2542. https://doi.org/10.1093/bioinformatics/bts474. Leek, J. T., & Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9), 1724–1735. https://doi.org/10.1371/journal.pgen.0030161. Levinson, D. F. (2006). The genetics of depression: A review. Biological Psychiatry, 60(2), 84–92. https://doi.org/10.1016/j.biopsych.2005.08.024. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/ 10.1093/bioinformatics/btp324. Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: Short oligonucleotide alignment program. Bioinformatics, 24(5), 713–714. https://doi.org/ 10.1093/bioinformatics/btn025. Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdottir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular Signatures Database (MSigDB) 3.0. Bioinformatics, 27(12), 1739–1740. https://doi.org/10.1093/bioinformatics/btr260. Lim, J. H., Lee, S. Y., & Kim, J. H. (2017). TRAPR: R package for statistical analysis and visualization of RNA-Seq data. Genomics Inform, 15(1), 51–53. https://doi.org/10.5808/GI.2017.15.1.51. Little, J., Bradley, L., Bray, M. S., Clyne, M., Dorman, J., Ellsworth, D. L., … Weinberg, C. (2002). Reporting, appraising, and integrating data on genotype prevalence and gene-disease associations. American Journal of Epidemiology, 156(4), 300–310. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8. LoVerso, P. R., & Cui, F. (2015). A computational pipeline for cross-species analysis of RNA-seq data using R and Bioconductor. Bioinformatics and Biology Insights, 9, 165–174. https://doi.org/10.4137/BBI.S30884. Lunetta, K. L., Hayward, L. B., Segal, J., & Van Eerdewegh, P. (2004). Screening large-scale association study data: Exploiting interactions using random forests. BMC Genetics, 5, 32. https://doi.org/10.1186/1471-2156-5-32. Malhotra, D., & Sebat, J. (2012). CNVs: Harbingers of a rare variant revolution in psychiatric genetics. Cell, 148(6), 1223–1241. https://doi.org/10.1016/ j.cell.2012.02.039. Marchini, J., Howie, B., Myers, S., McVean, G., & Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics, 39(7), 906–913. https://doi.org/10.1038/ng2088. McClellan, J., & King, M. C. (2010). Genetic heterogeneity in human disease. Cell, 141(2), 210–217. https://doi.org/10.1016/j.cell.2010.03.032. McQueen, M. B., Devlin, B., Faraone, S. V., Nimgaonkar, V. L., Sklar, P., Smoller, J. W., … Laird, N. M. (2005). Combined analysis from eleven linkage studies of bipolar disorder provides strong evidence of susceptibility loci on chromosomes 6q and 8q. American Journal of Human Genetics, 77(4), 582–595. https://doi.org/10.1086/491603. Mehta, D., & Binder, E. B. (2012). Gene x environment vulnerability factors for PTSD: The HPA-axis. Neuropharmacology, 62(2), 654–662. https://doi. org/10.1016/j.neuropharm.2011.03.009.
114 Personalized psychiatry
Michels, K. B., Binder, A. M., Dedeurwaerder, S., Epstein, C. B., Greally, J. M., Gut, I., … Irizarry, R. A. (2013). Recommendations for the design and analysis of epigenome-wide association studies. Nature Methods, 10(10), 949–955. https://doi.org/10.1038/nmeth.2632. Millstein, J., Conti, D. V., Gilliland, F. D., & Gauderman, W. J. (2006). A testing framework for identifying susceptibility genes in the presence of epistasis. American Journal of Human Genetics, 78(1), 15–27. https://doi.org/10.1086/498850. Moore, J. H., & Hahn, L. W. (2002). A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pacific Symposium on Biocomputing, 53–64. Moore, J. H., Hahn, L. W., Ritchie, M. D., Thornton, T. A., & White, B. C. (2004). Routine discovery of complex genetic models using genetic algorithms. Applied Soft Computing, 4(1), 79–86. https://doi.org/10.1016/j.asoc.2003.08.003. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7), 621–628. https://doi.org/10.1038/nmeth.1226. Morton, N. E. (1955). Sequential tests for the detection of linkage. American Journal of Human Genetics, 7(3), 277–318. Mukherjee, B., & Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics, 64(3), 685–694. https://doi.org/10.1111/j.1541-0420.2007.00953.x. Mullins, N., Power, R. A., Fisher, H. L., Hanscombe, K. B., Euesden, J., Iniesta, R., … Lewis, C. M. (2016). Polygenic interactions with environmental adversity in the aetiology of major depressive disorder. Psychological Medicine, 46(4), 759–770. https://doi.org/10.1017/S0033291715002172. Murcray, C. E., Lewinger, J. P., Conti, D. V., Thomas, D. C., & Gauderman, W. J. (2011). Sample size requirements to detect gene-environment interactions in genome-wide association studies. Genetic Epidemiology, 35(3), 201–210. https://doi.org/10.1002/gepi.20569. Nelson, M. R., Kardia, S. L., Ferrell, R. E., & Sing, C. F. (2001). A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research, 11(3), 458–470. https://doi.org/10.1101/gr.172901. Ng, M. Y., Levinson, D. F., Faraone, S. V., Suarez, B. K., DeLisi, L. E., Arinami, T., … Lewis, C. M. (2009). Meta-analysis of 32 genome-wide linkage studies of schizophrenia. Molecular Psychiatry, 14(8), 774–785. https://doi.org/10.1038/mp.2008.135. Nikolayeva, O., & Robinson, M. D. (2014). edgeR for differential RNA-seq and ChIP-seq analysis: An application to stem cell biology. Methods in Molecular Biology, 1150, 45–79. https://doi.org/10.1007/978-1-4939-0512-6_3. Ott, J., & Terwiliger, J. D. (1994). Handbook for human genetic linkage. Baltimore, MD: Johns Hopkins University Press. Park, M. Y., & Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Biostatistics, 9(1), 30–50. https://doi.org/10.1093/biostatistics/kxm010. Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2(12). https://doi.org/10.1371/journal.pgen. 0020190. Pe’er, I., Yelensky, R., Altshuler, D., & Daly, M. J. (2008). Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genetic Epidemiology, 32(4), 381–385. https://doi.org/10.1002/gepi.20303. Peyrot, W. J., Milaneschi, Y., Abdellaoui, A., Sullivan, P. F., Hottenga, J. J., Boomsma, D. I., & Penninx, B. W. (2014). Effect of polygenic risk scores on depression in childhood trauma. The British Journal of Psychiatry, 205(2), 113–119. https://doi.org/10.1192/bjp.bp.113.143081. Pinto, D., Pagnamenta, A. T., Klei, L., Anney, R., Merico, D., Regan, R., … Betancur, C. (2010). Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 466(7304), 368–372. https://doi.org/10.1038/nature09146. Plomin, R., DeFries, J. C., McClearn, G., & Rutter, M. (1997). Behavioural genetics. W.H. Freeman. Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature Reviews. Genetics, 11(7), 459–463. https://doi.org/10.1038/nrg2813. Pulst, S. M. (1999). Genetic linkage analysis. Archives of Neurology, 56(6), 667–672. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., … Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3), 559–575. https://doi.org/10.1086/519795. Rackham, O. J., Langley, S. R., Oates, T., Vradi, E., Harmston, N., Srivastava, P. K., … Petretto, E. (2017). A Bayesian approach for analysis of wholegenome bisulfite sequencing data identifies disease-associated changes in DNA methylation. Genetics, 205(4), 1443–1458. https://doi.org/10.1534/ genetics.116.195008. Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., … Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature, 444(7118), 444–454. https://doi.org/10.1038/nature05329. Reich, D. E., & Lander, E. S. (2001). On the allelic spectrum of human disease. Trends in Genetics, 17(9), 502–510. Risch, N., & Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science, 273(5281), 1516–1517. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). Limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Research, 43(7), e47. https://doi.org/10.1093/nar/gkv007. Ritchie, M. D., White, B. C., Parker, J. S., Hahn, L. W., & Moore, J. H. (2003). Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics, 4, 28. https://doi.org/10.1186/14712105-4-28. Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616. Ruderfer, D. M., Fanous, A. H., Ripke, S., McQuillin, A., Amdur, R. L., Schizophrenia Working Group of the Psychiatric Genomics Consortium, … Kendler, K. S. (2014). Polygenic dissection of diagnosis and clinical dimensions of bipolar disorder and schizophrenia. Molecular Psychiatry, 19(9), 1017–1024. https://doi.org/10.1038/mp.2013.138. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., … Wigler, M. (2004). Large-scale copy number polymorphism in the human genome. Science, 305(5683), 525–528. https://doi.org/10.1126/science.1098918.
Statistical genetic concepts in psychiatric genomics Chapter 10
115
Segre, A. V., DIAGRAM Consortium, MAGIC Investigators, Groop, L., Mootha, V. K., Daly, M. J., & Altshuler, D. (2010). Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genetics, 6(8). https://doi.org/10.1371/ journal.pgen.1001058. Sharma, S., Powers, A., Bradley, B., & Ressler, K. J. (2016). Gene x environment determinants of stress- and anxiety-related disorders. Annual Review of Psychology, 67, 239–261. https://doi.org/10.1146/annurev-psych-122414-033408. Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., … Froguel, P. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445(7130), 881–885. https://doi.org/10.1038/nature05616. Smith, P. G., & Day, N. E. (1984). The design of case-control studies: The influence of confounding and interaction effects. International Journal of Epidemiology, 13(3), 356–365. Smoller, J. W. (2014). Psychiatric genetics and the future of personalized treatment. Depression and Anxiety, 31(11), 893–898. https://doi.org/10.1002/ da.22322. Speed, D., Hemani, G., Johnson, M. R., & Balding, D. J. (2012). Improved heritability estimation from genome-wide SNPs. American Journal of Human Genetics, 91(6), 1011–1021. https://doi.org/10.1016/j.ajhg.2012.10.010. Spielman, R. S., McGinnis, R. E., & Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics, 52(3), 506–516. Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348. https://doi.org/10.1037/a0016973. Sullivan, P. F. (2010). The psychiatric GWAS consortium: Big science comes to psychiatry. Neuron, 68(2), 182–186. https://doi.org/10.1016/j. neuron.2010.10.003. Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., … von Mering, C. (2015). STRING v10: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43(Database issue), D447–D452. https://doi.org/10.1093/nar/gku1003. Tamiya, G., Shinya, M., Imanishi, T., Ikuta, T., Makino, S., Okamoto, K., … Inoko, H. (2005). Whole genome association study of rheumatoid arthritis using 27 039 microsatellites. Human Molecular Genetics, 14(16), 2305–2321. https://doi.org/10.1093/hmg/ddi234. Thomas, D. (2010). Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies. Annual Review of Public Health, 31, 21–36. https://doi.org/10.1146/annurev.publhealth.012809.103619. Tian, C., Plenge, R. M., Ransom, M., Lee, A., Villoslada, P., Selmi, C., … Seldin, M. F. (2008). Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genetics, 4(1). https://doi.org/10.1371/journal.pgen.0040004. Tomita, Y., Tomida, S., Hasegawa, Y., Suzuki, Y., Shirakawa, T., Kobayashi, T., & Honda, H. (2004). Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of prediction model on childhood allergic asthma. BMC Bioinformatics, 5, 120. https:// doi.org/10.1186/1471-2105-5-120. Torres-Garcia, W., Zheng, S., Sivachenko, A., Vegesna, R., Wang, Q., Yao, R., … Verhaak, R. G. (2014). PRADA: Pipeline for RNA sequencing data analysis. Bioinformatics, 30(15), 2224–2226. https://doi.org/10.1093/bioinformatics/btu169. Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., … Pachter, L. (2010). Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5), 511–515. https://doi.org/10.1038/ nbt.1621. Uhr, M., Tontsch, A., Namendorf, C., Ripke, S., Lucae, S., Ising, M., … Holsboer, F. (2008). Polymorphisms in the drug transporter gene ABCB1 predict antidepressant treatment response in depression. Neuron, 57(2), 203–209. https://doi.org/10.1016/j.neuron.2007.11.017. Varet, H., Brillet-Gueguen, L., Coppee, J. Y., & Dillies, M. A. (2016). SARTools: A DESeq2- and EdgeR-based R pipeline for comprehensive differential analysis of RNA-Seq data. PLoS One, 11(6). https://doi.org/10.1371/journal.pone.0157022. Visscher, P. M., Brown, M. A., McCarthy, M. I., & Yang, J. (2012). Five years of GWAS discovery. American Journal of Human Genetics, 90(1), 7–24. https://doi.org/10.1016/j.ajhg.2011.11.029. Wakefield, J., Haneuse, S., Dobra, A., & Teeple, E. (2011). Bayes computation for ecological inference. Statistics in Medicine, 30(12), 1381–1396. https:// doi.org/10.1002/sim.4214. Wei, J., & Hemmings, G. P. (2000). The NOTCH4 locus is associated with susceptibility to schizophrenia. Nature Genetics, 25(4), 376–377. https://doi. org/10.1038/78044. Weinberg, C. R., & Umbach, D. M. (2000). Choosing a retrospective design to assess joint genetic and environmental contributions to risk. American Journal of Epidemiology, 152(3), 197–203. Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145), 661–678. https://doi.org/10.1038/nature05911. Wray, N. R., Lee, S. H., Mehta, D., Vinkhuyzen, A. A., Dudbridge, F., & Middeldorp, C. M. (2014). Research review: Polygenic methods and their application to psychiatric traits. Journal of Child Psychology and Psychiatry, 55(10), 1068–1087. https://doi.org/10.1111/jcpp.12295. Wray, N. R., Sullivan, P. F., & Major Depressive Disorder Working Group of the PGC. (2017). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depressive disorder. bioRxiv. https://doi.org/10.1101/167577. Wreczycka, K., Gosdschan, A., Yusuf, D., Gruning, B., Assenov, Y., & Akalin, A. (2017). Strategies for analyzing bisulfite sequencing data. Journal of Biotechnology, 261, 105–115. https://doi.org/10.1016/j.jbiotec.2017.08.007. Wright, M. L., Dozmorov, M. G., Wolen, A. R., Jackson-Cook, C., Starkweather, A. R., Lyon, D. E., & York, T. P. (2016). Establishing an analytic pipeline for genome-wide DNA methylation. Clinical Epigenetics, 8, 45. https://doi.org/10.1186/s13148-016-0212-7. Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., … Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42(7), 565–569. https://doi.org/10.1038/ng.608.
116 Personalized psychiatry
Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics, 88(1), 76–82. https://doi.org/10.1016/j.ajhg.2010.11.011. Yu, K., Wacholder, S., Wheeler, W., Wang, Z., Caporaso, N., Landi, M. T., & Liang, F. (2012). A flexible Bayesian model for studying gene-environment interaction. PLoS Genetics, 8(1). https://doi.org/10.1371/journal.pgen.1002482. Zhang, Y., & Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39(9), 1167–1173. https://doi.org/ 10.1038/ng2110. Zheng, J., Erzurumluoglu, A. M., Elsworth, B. L., Kemp, J. P., Howe, L., Haycock, P. C., … Neale, B. M. (2017). LD Hub: A centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics, 33(2), 272–279. https://doi.org/10.1093/bioinformatics/btw613.
Further reading Keverne, J., Czamara, D., Cubells, J. F., & Binder, E. B. (2016). Genetics and genomics. In A. Schatzberg, & C. B. Nemeroff (Eds.), Textbook of psychopharmacology. (5th ed.). The American Psychiatric Association Publishing.