Neuroscience 145 (2007) 1273–1279
COMPUTATIONAL PREDICTION OF THE EFFECTS OF NON-SYNONYMOUS SINGLE NUCLEOTIDE POLYMORPHISMS IN HUMAN DNA REPAIR GENES S. NAKKEN,a I. ALSETHa AND T. ROGNESa,b*
The most common form of genetic variation in the human population occurs as single nucleotide polymorphisms (SNPs). It has been estimated that there exists one SNP with a minor allele frequency greater than 1% every 290 base-pairs in the human genome, implying a total of about 10 million SNPs (Kruglyak and Nickerson, 2001). The large number of SNPs, combined with a growing functional annotation of the human genome sequence, provides ample opportunity for developing improved links between genetic and phenotypic variation. Detection of genetic variants that contribute susceptibility to a complex human disease is usually undertaken as an association analysis. In such studies, the allele frequencies of a set of polymorphisms are compared between affected cases and healthy controls. In this manner, one can identify markers that differ significantly between the two groups. A number of studies have shown association between one or a few polymorphisms and complex diseases, but most of them have been hard to replicate (Lohmueller et al., 2003). Inconsistent results may have many explanations. Often-cited reasons are improper study design, insufficient sample size and complexity of traits (Au and Salama, 2005; Newton-Cheh and Hirschhorn, 2005). The design and statistical analysis of genome-wide association studies is still a field in its infancy. The simplest and most commonly used strategy limits the polymorphic markers to the coding regions of candidate genes that are known or hypothesized to be associated with the trait of interest. By adopting such a direct approach, one is targeting a smaller number of polymorphisms that are themselves believed to be putative causal alleles. In order to increase the success rate of direct association studies, it is therefore critical to prioritize markers that have a high probability of being functional. The most powerful way of assessing the effects of polymorphisms in coding regions is by focusing on the fraction that alter the encoded amino acid sequence (nonsynonymous SNP (nsSNP) or missense change). These substitutions may directly affect the protein structure stability and efficiency of protein interactions. The biochemical severity (e.g. differences in side-chain polarity) of the substitution and the degree of evolutionary conservation at the variant site are examples of properties that indicate the degree of functional significance of an amino acid alteration. These features may act as predictors for the anticipated phenotypic effects of missense changes. A comprehensive analysis of properties of nsSNPs can as such make important contributions in the exploration of hypotheses concerning the plausible biological mechanisms that
a
Centre for Molecular Biology and Neuroscience, Institute of Medical Microbiology, Rikshospitalet-Radiumhospitalet Medical Centre, NO0027 Oslo, Norway
b
Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway
Abstract—Non-synonymous single nucleotide polymorphisms (nsSNPs) represent common genetic variation that alters encoded amino acids in proteins. All nsSNPs may potentially affect the structure or function of expressed proteins and could therefore have an impact on complex diseases. In an effort to evaluate the phenotypic effect of all known nsSNPs in human DNA repair genes, we have characterized each polymorphism in terms of different functional properties. The properties are computed based on amino acid characteristics (e.g. residue volume change); position-specific phylogenetic information from multiple sequence alignments and from prediction programs such as SIFT (Sorting Intolerant From Tolerant) and PolyPhen (Polymorphism Phenotyping). We provide a comprehensive, updated list of all validated nsSNPs from dbSNP (public database of human single nucleotide polymorphisms at National Center for Biotechnology Information, USA) located in human DNA repair genes. The list includes repair enzymes, genes associated with response to DNA damage as well as genes implicated with genetic instability or sensitivity to DNA damaging agents. Out of a total of 152 genes involved in DNA repair, 95 had validated nsSNPs in them. The fraction of nsSNPs that had high probability of being functionally significant was predicted to be 29.6% and 30.9%, by SIFT and PolyPhen respectively. The resulting list of annotated nsSNPs is available online (http:// dna.uio.no/repairSNP), and is an ongoing project that will continue assessing the function of coding SNPs in human DNA repair genes. © 2006 IBRO. Published by Elsevier Ltd. All rights reserved. Key words: SNP, non-synonymous, dbSNP, phenotypic effects.
*Correspondence to: T. Rognes, Centre for Molecular Biology and Neuroscience, Institute of Medical Microbiology, RikshospitaletRadiumhospitalet Medical Centre, NO-0027 Oslo, Norway. Tel: ⫹47-22844787; fax: ⫹47-22844782. E-mail address:
[email protected] (T. Rognes). Abbreviations: BER, base excision repair; BLAST, Basic Local Alignment Search Tool; cSNP, coding single nucleotide polymorphism (SNP in protein-coding region); dbSNP, public database of human single nucleotide polymorphisms at National Center for Biotechnology Information, USA; NCBI, National Center for Biotechnology Information; nsSNP, non-synonymous single nucleotide polymorphism; PDB, Protein Data Bank; PolyPhen, Polymorphism Phenotyping; PSSM, position-specific scoring matrix; SIFT, Sorting Intolerant From Tolerant; SNP, single nucleotide polymorphism.
0306-4522/07$30.00⫹0.00 © 2006 IBRO. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.neuroscience.2006.09.004
1273
1274
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279
may explain the association of a gene with a specific trait. Indeed, in their in-depth-analysis of XPD (ERCC2) polymorphisms, Clarkson and Wood (2005) show that a closer investigation of the polymorphisms in question, both in silico and experimentally, is needed before phenotype/ genotype association studies are performed. DNA repair genes play a critical role in the maintenance of genome integrity. Variation in these genes may modulate the repair capacity, which in effect may lead to elevated risk of complex disease (e.g. various types of cancer) (Berwick and Vineis, 2000; Spitz et al., 2003; Hung et al., 2005). An overview and functional analysis of validated nsSNPs in the coding regions of 152 human DNA repair genes is presented here. Although restricted to repair genes in our analysis, the general strategy is applicable to any functional set of genes. The aim of the study is to assess the isolated phenotypic effects of all codingregion nsSNPs in repair genes, and as such contribute in the process of selecting target SNPs for direct association studies involving repair genes. Our selection of repair genes is based on the list provided by Wood and colleagues (Wood et al., 2001, 2005), accessible at http://www. cgal.icnet.uk/DNA_Repair_Genes.html. Two other genes of relevance to DNA repair are also included; TFAM (mitochondrial transcription factor A), responsible for maintaining mitochondrial DNA (Kang and Hamasaki, 2005), and SIRT3, a mitochondrial NAD-dependent deacetylase which may have a role in human longevity (Rose et al., 2003). Initially, we outline a brief procedure for obtaining reliable SNP data. Further, we review the various computational approaches for in silico predictions of nsSNPs. Finally, we report the results of applying a set of these predictors on nsSNPs in repair genes. Previously, similar studies have been reported (Savas et al., 2004; Xi et al., 2004; Zhu et al., 2004; Rudd et al., 2005), with different types of data material and different subsets of repair genes. We provide an updated, comprehensive and publicly available resource on known polymorphisms within this functional category of genes.
SNP MINING The largest repository of SNP data is located within the National Center for Biotechnology Information (NCBI, Bethesda, MD, USA) dbSNP database (public database of human single nucleotide polymorphisms at National Center for Biotechnology Information, USA; http://www.ncbi.nlm.nih.gov/SNP). The bidirectional data exchange between dbSNP and other large SNP efforts such as HGVbase (Fredman et al., 2002) and TSC (Holden, 2002) has ensured its position as the main public resource for SNP mining. However, an unknown fraction of the submissions to dbSNP may not be true polymorphisms, but rather examples of sequencing errors or paralogous sequence variants (Fredman et al., 2004). Large-scale verification studies of putative polymorphic loci in dbSNP have reported a monomorphic rate of 17– 48% (different rates are likely to stem from different technologies and protocols) (Carlson et al., 2003; Reich et al., 2003; Nelson et al., 2004). For the
purpose of selecting proper SNPs for an association study it is therefore essential to exclude entries that have a high probability of being artifacts or monomorphic in the relevant population. At first, a verification of the genomic location of a SNP should be performed by a BLAST (Basic Local Alignment Search Tool; Altschul et al., 1997) search with the flanking sequence of the SNP as the query (Savas et al., 2004). Neighboring sequences of SNPs that match multiple regions in the genome are then filtered out as unreliable. Furthermore, a reliability measure is provided by NCBI in the form of a record validation status; a polymorphism is validated by either independent submissions, frequency/genotype data, observed alleles in at least two chromosomes or submitted allele frequencies estimated by the HapMap project (International HapMap Consortium, 2003). The use of validated entries, preferably with allele frequency information, has shown to increase the genotype success rate significantly (Nelson et al., 2004; Edvardsen et al., 2006), indicating that these entries are more likely true polymorphisms. As of April 2006, dbSNP contains approximately 4.9 million validated biallelic positions in the human genome. In our analysis of nsSNPs in repair genes, we have excluded non-validated polymorphisms and entries that map to multiple locations in the human genome. Since some repair genes encode several protein isoforms, a polymorphism is included several times if it occurs within more than one RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq) gene product. Whenever available, we annotate the polymorphisms with estimated population-specific allele frequencies from two different sources, the HapMap project and Perlegen (http://genome.perlegen.com) (Hinds et al., 2005).
PREDICTION OF FUNCTIONAL MISSENSE CHANGES Allelic variants that alter the amino acid sequence of a gene product may affect the cellular phenotype at various levels. They may directly influence the stability of the native protein structure and the folding rate, resulting in a reduced concentration of the protein (Karchin et al., 2005). Polymorphisms residing in ligand-binding and catalytic sites may further affect protein interactions and other biochemical activities inside the cell (Sunyaev et al., 2001). Effects at the level of transcription, translation as well as post-translational modification can also occur, but these are relatively poorly characterized (Wang and Moult, 2001). Following the steadily increasing number of known human nsSNPs, there has been growing interest in the identification of the subset that may affect the cellular phenotype. The ultimate goal is a separation between neutral, non-functional nsSNPs and the ones that are functional, providing a damaging potential to the encoded protein. The proposed approaches for this problem use various types of features for prediction, mainly physical and chemical properties of the amino acids, structural properties of the encoded protein and evolutionary properties derived from sequence alignments of homologous proteins
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279
(Mooney, 2005). It has been shown that these properties differ significantly between rare Mendelian disease-causing mutations and seemingly neutral mutations (Saunders and Baker, 2002). Based on this observation, a number of computational classification models have been developed for the purpose of discriminating between tolerant and intolerant SNPs in the human genome (Saunders and Baker, 2002; Herrgard et al., 2003; Krishnan and Westhead, 2003; Cai et al., 2004; Ferrer-Costa et al., 2004; Bao and Cui, 2005; Yue and Moult, 2006). However, for the case of complex human traits, in which SNPs are believed to play one of their main roles, caution should be exercised when developing discriminative models using monogenic disease mutations as “true positives.” Using a fairly small dataset, a recent study showed how sites of coding single nucleotide polymorphisms (cSNPs) presumably associated with complex disease displayed a lesser degree of evolutionary sequence conservation than mutations involved in Mendelian disorders (Thomas and Kejariwal, 2004). While this observation may potentially limit the applicability of an in silico analysis of SNPs as outlined here, the usefulness has been exemplified in other reports, showing benchmarks of how this type of analysis may correctly predict known risk variants as damaging (Savas et al., 2004; Zhu et al., 2004). In the following subsections, we review the various types of properties that are commonly examined in the analysis and identification of putative functional nsSNPs. We also outline the computations we have performed in our analysis. Amino acid properties The physicochemical properties of the 20 amino acids play a significant role in protein folding and stability. Basic characteristics of the side chains such as molecular mass, polarity, acidity, basicity, aromaticity, conformational flexibility and ability to hydrogen bond are responsible for a great range of protein structure properties (Voet, 1995). Thus, the compatibility of a substitution can to some extent be evaluated based on basic features of the amino acids. The classic Grantham matrix was developed on these premises, providing a quantitative measure of chemical distance between the different amino acids (Grantham, 1974). Recently, an incorporation of sequence variation in a multiple sequence alignment extended the general Grantham distance to a more evolutionary, protein-specific Grantham score (Tavtigian et al., 2006). The Align-GVGD (Align-Grantham Variation Grantham Deviation) score reflects how the chemical characteristics of the variant amino acid fits within the range of variation tolerated at the substitution site. The method was developed for the purpose of scoring missense substitutions in the BRCA1 gene only, and needs to be evaluated further on a larger scale. Furthermore, Kyte and Doolittle (1982) have composed a hydropathy scale in which the hydrophilicity and hydrophobicity along a protein chain can be determined, properties that are essential for the protein folding process. The magnitude of change in hydrophobicity between two variants gives an estimate of how well the hydrophobic
1275
nature of a residue is conserved (Balasubramanian et al., 2005). In addition to Grantham scores and changes in hydrophobicity, we annotated the nsSNPs in our dataset with absolute changes in molecular mass and volume of the variant side chains (Voet, 1995). The idea here is that a large change in mass or volume of the side chains involved in an amino acid substitution will increase their damaging potential. Protein structure properties The first approaches for studying the phenotypic effects of cSNPs utilized properties from experimentally determined protein structures. Wang and Moult (2001) analyzed disease-causing missense cSNPs in 23 proteins from the Human Gene Mutation Database (HGMD), and devised a number of rules based upon the protein structure stability that could capture effects of the SNP on molecular function. Examples of the rules are loss of hydrogen bonds, introduction of a buried polar residue, loss of salt bridge, insertion of proline into ␣-helix, and breakage of disulfide bond. Similar rules were employed by others (Chasman and Adams, 2001; Sunyaev et al., 2001). Chasman et al. also emphasized the solvent accessibility (Bowie et al., 1990) and the B-factor (Matthews, 1995) of the polymorphic model residue as important discriminative factors in their structure-based assessment of nsSNPs (Chasman and Adams, 2001). Stitziel et al. (2003) used computational geometry analysis of structural sites of diseaseassociated nsSNPs, and showed that the majority of them are located in pockets or voids of the protein. The current number of known protein structures is unfortunately far less than the number of known human protein sequences. For human DNA repair protein products, a crude search (sequence homology search with ParAlign (Rognes, 2001), E-value cutoff at 10.e-15) in the Protein Data Bank (PDB) revealed that approximately 36% (55 out of a total of 152 repair genes) had no significant matches with experimentally determined peptide structures. Although homology models may infer structural information for sequences without known structure, we have not included an in-depth structural analysis of nsSNPs in repair genes. The algorithm used by PolyPhen (polymorphism phenotyping; see below), however, uses structural data from PDB whenever it finds this available, and these predictions are included in our analysis. Despite the discrepancy between available structure and sequence data, the importance of structural analysis of nsSNPs is undisputed, especially when the number of homologs is few and the evolutionary analysis becomes less reliable (see below) (Saunders and Baker, 2002). Evolutionary properties Highly conserved residues in a protein family are generally expected to be important for the function of the protein. An evolutionary approach to SNP screening can thus by applied by the extraction of conservation scores from a multiple sequence alignment of homologous proteins. Two commonly used tools are based on this approach. SIFT (Sorting Intolerant From Tolerant, http://blocks.fhcrc.org/sift/SIFT.html)
1276
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279
predicts whether an amino acid substitution may have impact on protein function by a specialized alignment of orthologous and/or paralogous protein sequences; calculating a score representing the likelihood of mutability at the site of the substitution (Ng and Henikoff, 2001; Ng and Henikoff, 2003). PolyPhen (http://www.bork.embl-heidelberg.de/polyphen) combines a conservation score with additional properties (physicochemical differences and structural features of the polymorphic variants) in order to predict the functional importance of an amino acid alteration (Sunyaev et al., 2000, 2001; Ramensky et al., 2002). A simple measure of residue diversity at a position j in a multiple sequence alignment can also be computed in terms of a site entropy score, Sj, calculated by normalizing Shannon’s original formula (Shannon, 1948; Valdar, 2002): S j⫽
1 共lnK兲
K
兺共piln共pi兲兲 i
Here K is the number of amino acids and pi is the probability of observing amino acid i at position j. Conserved sites in a sequence alignment have site entropy scores close to zero. A large gap fraction in alignment columns will pose problems to the site entropy calculation, since a gap is merely treated as the 21st amino acid. A minimum requirement for a reliable entropy score is a gap fraction less than 0.5. The position-specific scoring matrix (PSSM) produced by a position-specific iterated BLAST (PSIBLAST) (Altschul et al., 1997) search provides yet another option for determining the likelihood for a substitution to occur at a specific site. The probability of substituting a variant residue of type a at position j in the sequence alignment, P(a,j), is taken directly from the corresponding matrix element in the PSSM (Yue and Moult, 2006). Due to limited amounts of available protein structure data, multiple sequence alignments and measures of residue conservation have become the most utilized predictor types for the functional importance of an amino acid substitution. Generally, a prediction will be most reliable when a large number of homologous sequences are found. Also, it is important that the public sequences used in the alignment be correctly assembled and trustworthy. In some instances, automatically generated sequence alignments need manual curation in order to remove errors and produce biologically meaningful alignments. Since the various evolutionary predictors are all extracted from automatically generated alignments, misleading predictions may occur. We have used SIFT scores, PolyPhen scores, site entropy, average site entropy and the PSSM-score as our evolutionary features. For the computation of entropy scores, we searched for homologs with ParAlign (Rognes, 2001), and made multiple sequence alignments with the MUSCLE program (Edgar, 2004). Table 1 summarizes all of the predictors used in our analysis.
COMPUTATIONAL PREDICTIONS OF nsSNPs IN DNA REPAIR GENES Careful filtering of dbSNP entries as outlined above resulted in 677 validated transcript-specific nsSNPs located
Table 1. Description of the various numerical predictors used for identification of functional, potentially damaging nsSNPs Mass
Absolute difference in side-chain mass (D) between variant residues (Voet, 1995) Volume Absolute difference in side-chain volume (cubic Å) between variant residues (Voet, 1995) Hydrophobicity Change in hydrophobicity between variant residues as calculated by the Kyte Doolittle scale (Kyte and Doolittle, 1982) Grantham Chemical distance between amino acids (Grantham, 1974) Site entropy Shannon’s site entropy as a measure of residue conservation (Shannon, 1948) Average site Average site entropy in the protein sequence entropy SIFT Conservation score from MSA reflecting damaging potential of SNP (Ng and Henikoff, 2001; Ng and Henikoff, 2003) PSSM-score Likelihood of substituting amino acid j at position i in a sequence, taken from element i,j in the scoring matrix outputted by PSI-BLAST (Altschul et al., 1997; Yue and Moult, 2006) PolyPhen Combines structural, evolutionary and physicochemical properties in a substitution tolerance score (Sunyaev et al., 2001)
in 152 genes associated with DNA repair. The data with selected predictions for genes involved in the base excision repair (BER) pathway are listed in Table 2; the rest is available online at http://dna.uio.no/repairSNP. For 57 genes, dbSNP contained no validated nsSNPs; 51.8% of the nsSNPs were annotated with estimated allele frequencies from either HapMap or Perlegen, implying that these polymorphisms are not only validated according to the dbSNP criteria, but also have been proven to occur in a specific population in the world. An investigation of the nsSNPs at the nucleotide level revealed a fraction of transversions (shifts between pyrimidines and purines) of approximately 28.3%. SIFT and PolyPhen predicted the fraction of functional nsSNPs in repair genes to be 29.6% and 30.9%, respectively. Functional nsSNPs imply nsSNPs predicted “intolerant” by SIFT and “probably damaging” and “possibly damaging” (score ⬎1.500) by PolyPhen. Furthermore, for each predictor type we ranked the substitutions according to their damaging potential, and then computed the Spearman’s rank correlation coefficient between the rankings. The correlations were performed for the subset of nsSNPs that had reliable SIFT scores (SIFT reliability index ⬎0.5) and entropy scores based on alignment columns with a gap fraction of less than 0.5, resulting in n⫽192 nsSNPs occurring in 79 different repair genes. Table 3 shows pairs of predictors that showed significant correlations (obvious significant correlation coefficients between similar types of predictors, such as volume and mass, are omitted). As our allele frequency data have been fetched from two sources that differ in size and composition, HapMap and Perlegen, we have chosen to omit an investigation of a potential correlation between rare/common SNPs and their functional significance.
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279
1277
Table 2. Validated nsSNPs in genes in the BER pathway dbSNP ID
Protein
Gene
Substitutions
rs7183491 rs5745926 rs8191612 rs8191613 rs8191664 rs2233920 rs2307293 rs2307298 rs140693 rs2307289 rs4135113 rs3219496 rs3219489 rs3219484 rs3087468 rs1805378 rs2302172 rs3087469 rs17050550 rs1805373 rs3219012 rs1052133 rs1801128 rs2266607 rs25671 rs2308313 rs2308312 rs769193 rs2234949 rs4986973 rs1048945 rs2307486 rs3136820 rs1803118 rs2301416 rs25474 rs25487 rs25491 rs25490 rs2307188 rs25489 rs1799782 rs2307191 rs25496 rs2307186 rs3739186 rs3739185 rs3739168 rs3219145 rs1136410 rs2230484 rs3219057 rs1805409 rs3093905 rs3093906 rs3093921
NP_078884 NP_078884 NP_659480 NP_659480 NP_659480 NP_055126 NP_003916 NP_003916 NP_003916 NP_003916 NP_003202 NP_036354 NP_036354 NP_036354 NP_002519 NP_002519 NP_002519 NP_002519 NP_002533 NP_002533 NP_002533 NP_002533 NP_002533 NP_002425 NP_002425 NP_002425 NP_002425 NP_002425 NP_002425 NP_039269 NP_001632 NP_001632 NP_001632 NP_001632 NP_055296 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_006288 NP_009185 NP_009185 NP_009185 NP_001609 NP_001609 NP_001609 NP_001609 NP_001609 NP_005475 NP_005475 NP_005475
NEIL1 NEIL1 NEIL2 NEIL2 NEIL2 SMUG1 MBD4 MBD4 MBD4 MBD4 TDG MUTYH MUTYH MUTYH NTHL1 NTHL1 NTHL1 NTHL1 OGG1 OGG1 OGG1 OGG1 OGG1 MPG MPG MPG MPG MPG MPG LIG3 APEX1 APEX1 APEX1 APEX1 APEX2 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 XRCC1 PNKP PNKP PNKP PARP1 PARP1 PARP1 PARP1 PARP1 PARP2 PARP2 PARP2
C⬎G G⬎A C⬎T G⬎A G⬎T G⬎T G⬎C T⬎C G⬎A T⬎C G⬎A C⬎A G⬎C G⬎A G⬎T T⬎C G⬎A C⬎T G⬎T G⬎A C⬎T C⬎G G⬎C T⬎C A⬎G C⬎T G⬎A C⬎T G⬎T C⬎T G⬎C A⬎G T⬎G C⬎T C⬎T C⬎T G⬎A C⬎T A⬎G A⬎C G⬎A C⬎T C⬎T T⬎C G⬎T T⬎A C⬎A C⬎T G⬎A T⬎C C⬎T G⬎A G⬎A G⬎A A⬎G A⬎G
I⬎M D⬎N R⬎W R⬎Q R⬎L G⬎V D⬎H I⬎T E⬎K S⬎P G⬎S L⬎M Q⬎H V⬎M D⬎Y I⬎T R⬎K R⬎W A⬎S R⬎Q A⬎V S⬎C S⬎T Y⬎H Q⬎R R⬎C R⬎Q A⬎V A⬎S P⬎S Q⬎H I⬎V D⬎E A⬎V R⬎C P⬎L R⬎Q P⬎S T⬎A K⬎N R⬎H R⬎W P⬎L V⬎A R⬎L Y⬎N R⬎S P⬎S R⬎K V⬎A P⬎S V⬎I A⬎T S⬎N N⬎S D⬎G
pos
SIFT
PolyPhen
182 252 103 103 257 15 568 358 346 342 199 526 335 22 239 176 33 21 85 229 288 326 320 71 93 120 141 258 298 899 51 64 148 317 141 514 399 309 304 298 280 194 161 72 7 196 180 20 940 762 377 334 188 112 119 186
Intolerant Tolerant Intolerant Tolerant Tolerant Tolerant Intolerant Tolerant Tolerant Borderline Tolerant Intolerant Borderline Intolerant Intolerant Intolerant Borderline Intolerant Tolerant Intolerant Tolerant Borderline Tolerant Tolerant Tolerant Potentially Intolerant Tolerant n/a Borderline Borderline Potentially Tolerant Potentially Borderline Borderline Intolerant Tolerant Potentially Tolerant Intolerant Borderline Intolerant Intolerant Intolerant Tolerant Borderline Intolerant Tolerant Tolerant Tolerant Tolerant Tolerant Tolerant Tolerant Potentially
Borderline Potentially damaging Possibly damaging Benign Possibly damaging Possibly damaging Possibly damaging Borderline Benign Potentially damaging Possibly damaging Benign Borderline Benign Probably damaging Probably damaging n/a Probably damaging Benign Possibly damaging Borderline Benign Potentially damaging Potentially damaging Potentially damaging Probably damaging Potentially damaging Benign Benign n/a Benign Benign Benign Borderline Possibly damaging Probably damaging Benign Benign Borderline Possibly damaging Possibly damaging Probably damaging Probably damaging Potentially damaging Possibly damaging Probably damaging Potentially damaging Possibly damaging Possibly damaging Probably damaging Potentially damaging Benign Borderline Benign Potentially damaging Possibly damaging
intolerant
intolerant intolerant
intolerant
intolerant
Predictions of phenotypic effect from SIFT and PolyPhen are shown for each nsSNP. For genes with multiple transcripts (OGG1 and APEX1), only SNPs from the main transcripts are listed. The designations used for SIFT and PolyPhen scores (tolerant, intolerant, probably damaging etc.) have been used according to previously proposed classifications (Ng and Henikoff, 2003; Xi et al., 2004). n/a, data not available.
1278
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279
Table 3. Significant correlation coefficients between functional predictors for n⫽192 nsSNPs in 79 DNA repair genes Predictor
Spearman’s rank correlation coefficient (significance)
Mass/Grantham Mass/SIFT Mass/PolyPhen Mass/PSSM-score Volume/Grantham Volume/PolyPhen Grantham/SIFT Grantham/PolyPhen Grantham/PSSM-score Site entropy/SIFT PSSM-score/SIFT PolyPhen/SIFT PolyPhen/PSSM
0.43 (P⫽7.83e-10) ⫺0.18 (P⫽0.0123) 0.29 (P⫽4.249e-5) ⫺0.15 (P⫽0.0365) 0.63 (P⬍2.2e-16) 0.23 (P⫽0.0017) ⫺0.145 (P⫽0.044) 0.36 (P⫽3.51e-7) ⫺0.267 (P⫽0.00019) 0.306 (P⫽1.763e-5) 0.495 (P⬍2.2e-16) ⫺0.52 (P⬍2.2e-16) ⫺0.39 (P⫽2.58e-8)
DISCUSSION Numerous association studies are currently undertaken for the purpose of explaining how common genetic variation in the form of SNPs may influence risk of complex disease in humans. Due to the large number of SNP entries populated in public SNP databases, a key challenge in these studies is the selection of reliable SNPs that have a high probability of affecting the cellular phenotype. Within the context of DNA repair, we have described and applied numerous approaches for in silico prediction of functional nsSNPs. Our results can provide valuable insights and helpful guidance in the choice of nsSNPs for a direct association study involving repair genes. We found that a large fraction (71.7%) of the nsSNPs in repair genes were transitions (A ⬍⫺⬎ G or C ⬍⫺⬎ T). Generally, transversions have a more severe effect on the chemical structure of the DNA. Moreover, SIFT and PolyPhen predicted that roughly 30% of the nsSNPs in repair genes have functional significance at the protein level. These numbers are in good agreement with previous reports that employed the same algorithms for prediction (Savas et al., 2004; Rudd et al., 2005). The validity of these algorithms has mainly been based on benchmarking studies involving Mendelian disease mutations, where they have been shown to correctly predict in excess of 80% of known deleterious mutations (Sunyaev et al., 2000; Ng and Henikoff, 2002). However, our predictions have been done for SNPs, not rare mutations, which may display different characteristics with respect to their pathogenic potential (Thomas and Kejariwal, 2004). At present, the amount of data on known polymorphisms associated with complex disease is limited for a thorough evaluation of the accuracy of the various prediction algorithms. A minor attempt was performed by Savas et al. (2004), who reported how SIFT correctly predicted three experimentally determined cancer risk variants in the BER pathway as potentially damaging. The significant correlation coefficients between the various predictors we have applied (Table 3) show that there is good concordance between a number of the dif-
ferent properties we have examined. Change in hydrophobicity did not show any significant correlations with other predictors, and may thus be a weak predictive feature, as experienced by others (Balasubramanian et al., 2005). Some of the observed significant concordance can be explained by the fact the many of the predictors are based on similar concepts (e.g. sequence conservation). For the task of discriminating between putative functional and innocuous SNPs, it has been shown that the highest degree of accuracy is achieved when using a weighted combination of several types of predictors (Bao and Cui, 2005). A steady increase in the quantity and availability of protein structure data will give the analysis of SNPs at the protein structure level a better foundation. At present, approximate 3D structure modeling by sequence homology represents a feasible alternative when the target protein is missing an experimentally determined structure. The development of highly accurate, automatically generated homology models can henceforth make important contributions in the area of SNP analysis. Furthermore, large fractions of SNPs that occur in the non-protein coding regions of a gene (e.g. introns, 5= and 3= UTR), have not been analyzed in our work. These SNPs are likely to affect the level of gene expression to various extents, but an understanding of their mechanisms of influence is still very limited. The challenge of developing in silico predictions for these types of SNPs has not yet been met, and represents an important area of future research.
REFERENCES Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389 –3402. Au WW, Salama SA (2005) Use of biomarkers to elucidate genetic susceptibility to cancer. Environ Mol Mutagen 45:222–228. Balasubramanian S, Xia Y, Freinkman E, Gerstein M (2005) Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms. Nucleic Acids Res 33:1710 –1721. Bao L, Cui Y (2005) Prediction of the phenotypic effects of nonsynonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21:2185–2190. Berwick M, Vineis P (2000) Markers of DNA repair and susceptibility to cancer in humans: an epidemiologic review. J Natl Cancer Inst 92:874 – 897. Bowie JU, Reidhaar-Olson JF, Lim WA, Sauer RT (1990) Deciphering the message in protein sequences: tolerance to amino acid substitutions. Science 247:1306 –1310. Cai Z, Tsung EF, Marinescu VD, Ramoni MF, Riva A, Kohane IS (2004) Bayesian approach to discovering pathogenic SNPs in conserved protein domains. Hum Mutat 24:178 –184. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA (2003) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33:518 –521. Chasman D, Adams RM (2001) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol 307:683–706. Clarkson SG, Wood RD (2005) Polymorphisms in the human XPD (ERCC2) gene, DNA repair capacity and cancer susceptibility: an appraisal. DNA Repair (Amst) 4:1068 –1074.
S. Nakken et al. / Neuroscience 145 (2007) 1273–1279 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. Edvardsen H, Irene Grenaker Alnaes G, Tsalenko A, Mulcahy T, Yuryev A, Lindersson M, Lien S, Omholt S, Syvanen AC, Borresen-Dale AL, Kristensen VN (2006) Experimental validation of data mined single nucleotide polymorphisms from several databases and consecutive dbSNP builds. Pharmacogenet Genomics 16:207–217. Ferrer-Costa C, Orozco M, de la Cruz X (2004) Sequence-based prediction of pathological mutations. Proteins 57:811– 819. Fredman D, Siegfried M, Yuan YP, Bork P, Lehvaslaiho H, Brookes AJ (2002) HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res 30:387–391. Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ (2004) Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861– 866. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862– 864. Herrgard S, Cammer SA, Hoffman BT, Knutson S, Gallina M, Speir JA, Fetrow JS, Baxter SM (2003) Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors. Proteins 53:806 – 816. Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR (2005) Whole-genome patterns of common DNA variation in three human populations. Science 307:1072–1079. Holden AL (2002) The SNP consortium: summary of a private consortium effort to develop an applied map of the human genome. Biotechniques Suppl 22–24:26. Hung RJ, Hall J, Brennan P, Boffetta P (2005) Genetic polymorphisms in the base excision repair pathway and cancer risk: a HuGE review. Am J Epidemiol 162:925–942. International HapMap Consortium (2003) The International HapMap Project. Nature 426:789 –796. Kang D, Hamasaki N (2005) Mitochondrial transcription factor A in the maintenance of mitochondrial DNA: overview of its multiple roles. Ann N Y Acad Sci 1042:101–108. Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A (2005) LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 21:2814 –2820. Krishnan VG, Westhead DR (2003) A comparative study of machinelearning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics 19:2199 –2209. Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27:234 –236. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132. Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN (2003) Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 33:177–182. Matthews BW (1995) Studies on protein stability with T4 lysozyme. Adv Protein Chem 46:249 –278. Mooney S (2005) Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. Brief Bioinform 6: 44 –56. Nelson MR, Marnellos G, Kammerer S, Hoyal CR, Shi MM, Cantor CR, Braun A (2004) Large-scale validation of single nucleotide polymorphisms in gene regions. Genome Res 14:1664 –1668. Newton-Cheh C, Hirschhorn JN (2005) Genetic association studies of complex traits: design and analysis issues. Mutat Res 573:54 – 69. Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11:863– 874. Ng PC, Henikoff S (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Res 12:436 – 446.
1279
Ng PC, Henikoff S (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814. Ramensky V, Bork P, Sunyaev S (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30:3894 –3900. Reich DE, Gabriel SB, Altshuler D (2003) Quality and completeness of SNP databases. Nat Genet 33:457– 458. Rognes T (2001) ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res 29: 1647–1652. Rose G, Dato S, Altomare K, Bellizzi D, Garasto S, Greco V, Passarino G, Feraco E, Mari V, Barbi C, BonaFe M, Franceschi C, Tan Q, Boiko S, Yashin AI, De Benedictis G (2003) Variability of the SIRT3 gene, human silent information regulator Sir2 homologue, and survivorship in the elderly. Exp Gerontol 38:1065–1070. Rudd MF, Williams RD, Webb EL, Schmidt S, Sellick GS, Houlston RS (2005) The predicted impact of coding single nucleotide polymorphisms database. Cancer Epidemiol Biomarkers Prev 14:2598 –2604. Saunders CT, Baker D (2002) Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J Mol Biol 322: 891–901. Savas S, Kim DY, Ahmad MF, Shariff M, Ozcelik H (2004) Identifying functional genetic variants in DNA repair pathway using protein conservation analysis. Cancer Epidemiol Biomarkers Prev 13:801–807. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 379 – 423:623– 656. Spitz MR, Wei Q, Dong Q, Amos CI, Wu X (2003) Genetic susceptibility to lung cancer: the role of DNA damage and repair. Cancer Epidemiol Biomarkers Prev 12:689 – 698. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif S, Liang J (2003) Structural location of disease-associated single-nucleotide polymorphisms. J Mol Biol 327:1021–1030. Sunyaev S, Ramensky V, Bork P (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16:198 –200. Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, Bork P (2001) Prediction of deleterious human alleles. Hum Mol Genet 10:591–597. Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, de Silva D, Zharkikh A, Thomas A (2006) Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J Med Genet 43:295–305. Thomas PD, Kejariwal A (2004) Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci U S A 101:15398 –15403. Valdar WS (2002) Scoring residue conservation. Proteins 48:227–241. Voet D, Voet JG (1995) Biochemistry. New York: John Wiley & Sons, Inc. Wang Z, Moult J (2001) SNPs, protein structure, and disease. Hum Mutat 17:263–270. Wood RD, Mitchell M, Lindahl T (2005) Human DNA repair genes, 2005. Mutat Res 577:275–283. Wood RD, Mitchell M, Sgouros J, Lindahl T (2001) Human DNA repair genes. Science 291:1284 –1289. Xi T, Jones IM, Mohrenweiser HW (2004) Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function. Genomics 83:970 –979. Yue P, Moult J (2006) Identification and analysis of deleterious human SNPs. J Mol Biol 356:1263–1274. Zhu Y, Spitz MR, Amos CI, Lin J, Schabath MB, Wu X (2004) An evolutionary perspective on single-nucleotide polymorphism screening in molecular cancer epidemiology. Cancer Res 64:2251–2257.
(Accepted 12 September 2006) (Available online 19 October 2006)