Gene 246 (2000) 239–245 www.elsevier.com/locate/gene
Gene conversion among chemokine receptors Denis C. Shields * Department of Clinical Pharmacology, Royal College of Surgeons in Ireland, Dublin 2, Ireland Received 19 August 1999; received in revised form 27 December 1999; accepted 4 February 2000 Received by O. Clay
Abstract It has been proposed that proteins which are involved in host defence and susceptibility undergo accelerated evolution. Chemokine receptors have roles as pro-inflammatory agents acting in response to infection, and in addition are receptors for entry of viruses and other pathogens into cells. Consistent with this, their rate of evolution is higher than that for other members of the seven-transmembrane domain receptor family. The pattern of evolution of the chemokine receptors was examined in detail. Both chromosomal clusters of chemokine receptors (CC and CXC ) showed evidence of a number of gene conversions. These are likely to have resulted in protein sequence changes, which could possibly alter function. 45% of a control group of clustered genes also showed evidence of conversion. Thus, the fixation of a gene conversion is not in itself sufficiently unusual in tandemly repeated genes and cannot be taken as strong evidence of a selection for a novel function. However, the degree of amino acid difference between the chemokine receptors CCR1 and CCR3 was greater than that for any of the control genes. Such changes could have functional implications for inter-species differences in chemokine receptor interactions with pathogens. © 2000 Elsevier Science B.V. All rights reserved. Keywords: Adaptive evolution; Host–pathogen coevolution
1. Introduction Plants lack a specific immune response, so that the emergence of a novel variant in a pathogen gene is frequently followed by selection for a novel variant in a plant host gene that spreads as a resistance factor through the population, leading to the observed genefor-gene relationship ( Keen, 1990). The specific immune system of mammals buffers the non-specific immune system from many pathogen challenges. Therefore, fewer changes of the mammalian non-specific immune system are likely during evolution. However, there are numerous human polymorphisms that are associated with resistance to malaria and other infections (Hill and Motulsky, 1999), indicating that there are still evolutionary pressures on mammalian host components in response to pathogen pressures (Marrack and Kappler, 1994). It has been proposed (Murphy, 1993) that the high rate of evolution of host defence components
Abbreviations: CR, chemokine receptor; CCR, CC chemokine receptor; CXCR, CXC chemokine receptor. * Tel.: +353-1-4022381; fax: +353-1-4022388. E-mail address:
[email protected] (D.C. Shields)
reflects positive selection for novel host variants which confer an advantage with respect to particular pathogen challenges arising at particular times during evolution. However, a high evolutionary rate in a protein may just as well be associated with a low level of functional constraint. Extreme rates of protein change driven by positive selection will result in an excess of non-synonymous over synonymous changes when species are compared. Some host defence proteins, such as the eosinophilic ribonucleases (Rosenberg et al., 1995) and the hemopoietin family of cytokines (Shields et al., 1996), do in fact show such excesses. This suggests that their high rates of evolution result partly from positive selection of novel adaptive variants. However, the evolutionary rates of many host defence proteins are not so rapid that non-synonymous rates exceed synonymous rates, yet adaptive changes may still be occurring. Another clue that indicates adaptive evolution is acceleration in an evolutionary lineage which is consistent with adaptive changes. While the rate of nearly neutral changes in a protein should be approximately constant over time, it would be predicted that adaptive changes will occur in bursts, often in one lineage and not in the other.
0378-1119/00/$ - see front matter © 2000 Elsevier Science B.V. All rights reserved. PII: S0 3 7 8 -1 1 1 9 ( 0 0 ) 0 0 07 2 - X
240
D.C. Shields / Gene 246 (2000) 239–245
The chemokine receptors (CRs) CCR5 and Duffy are respectively known to be receptors of HIV and malaria vivax ( Fauci, 1996). Population null variants of CCR5 occur in both human and sooty mangabey populations, leading to the hypothesis that transient selection pressures have acted strongly on these proteins in the recent evolutionary past (Palacios et al., 1998). CRs additionally mediate the signalling of host immune responses, and are pirated by herpesviral genomes. Their rates of evolution appear faster than those of other seven-transmembrane domain receptors (Shields et al., 1996). In this study, we investigate whether the pathogen-related roles may have influenced their mode of evolution, by investigating the evolution of the group of chemokine receptors (CCRs) specific for ‘CC’ chemokines, the related ‘CXC’ chemokine receptors (CXCRs), and a number of related receptors.
2. Materials and methods The gene names used here are those of the CCR and CXCR numbering system, and where a CR has not been assigned any such name (usually because its ligand binding specificity is unknown) the SWISSPROT database ID has been used instead. The mouse SwissProt entry CKRY_MOUSE has been termed CCR2_MOUSE, on the basis of phylogenetic similarity with the human CCR2 protein. For human CCR2, the alternative splice variant has been chosen which confers greater similarity to the mouse CCR2 sequence. CRs from viral genomes have been designated simply by the virus name, if there is only one, or following an existing convention if there is more than one CR-related sequence within the genome. Sequences were obtained from the EMBL DNA and SwissProt protein databases: accession Nos are given in Table 2. Sequences of chemokine receptors (CR) were aligned and neighbour-joining trees (Saitou and Nei, 1987) drawn using CLUSTALW ( Thompson et al., 1994); default settings of the program were used. To compare evolutionary rates in difference CRs, percentage amino acid differences between man and rodent were calculated for the sequences in Fig. 1, excluding positions with gaps in any sequences. While it is more conventional to convert amino acid difference to the inferred number of amino acid replacements, this has not been carried out here, since such corrections are unreliable for the viral comparisons, and this conversion does not materially affect the evolutionary inferences under consideration. 2.1. Gene conversion No satisfactory method exists which can reliably reconstruct the evolutionary history of a group of proteins, allowing for insertions and deletions (the align-
Fig. 1. The tree of host chemokine receptors was constructed using the CLUSTALW (Thompson et al., 1994) implementation of the neighbour-joining method (Saitou and Nei, 1987).The tree is arbitrarily rooted with Duffy as an outgroup, although this is not known to be a true root. Branch lengths are proportional to the inferred number of amino acid replacements. Given the occurrence of gene conversions and low bootstrap support for many branchings, this tree is only a guide to the evolution of host proteins.
ment), the branching order of the genes (the phylogeny), and for gene conversion events. The following approach was carried out to seek evidence for the most obvious gene conversion events among clustered genes. DNA alignments were made for the host proteins based on the protein alignments. Sawyer’s VTDIST3 program (Sawyer, 1989) was used to detect identical segments of DNA between clustered genes within a species which may indicate gene conversion events. All DNA sites were used in the analysis, to maximise power to detect conversion events (assuming reasonable independence of adjacent replacements). Significance of the longest identical segments was determined by the segment length distributions found in 1000 simulations of permuted polymorphic sites, and an arbitrary cut-off P value of 0.01 was taken to reflect a significant gene conversion. The default settings of the program were used, except as indicated above. The first set of analyses considered only conversion events within a species, where the
241
D.C. Shields / Gene 246 (2000) 239–245
analysis included the following two sets of clustered genes: CCR1, CCR2, CCR3, CCR4 and CCR5 from human and mouse; and CXCR1 and CXCR2 from human, rabbit and rat. Since Sawyer’s method detected identical stretches of DNA, visual inspection allowed their extension to include highly similar regions contiguous with the regions identified initially by the program. Clearly, a much larger analysis could consider all the human/rodent chemokine receptor orthologues and carry out tests of conversion between all of them. However, the statistical power of the study would rapidly vanish in the light of the number of tests. For this reason, the study concentrated on identified gene clusters where the combination of physical proximity and sequence similarity makes them likely substrates for gene conversion in the first instance. To determine the incidence of gene conversion in the genome, a sample of tandem genes which have apparent homology between man and rodent was prepared by identifying genes on the Mouse Genome Database (Mouse Genome Database, 1997) list of mouse–man homologues which occur within 1 cM on the mouse genetic map, and where the two human genes were more than 50% identical along their protein sequences, obtained from the SWISSPROT 34 database. This small sample of genes should give a good indication of the frequency of gene conversion events. If one of the human genes was most similar to one of the rodent proteins, and the other with the adjacent rodent protein, these genes were then included for study, as it is likely that gene duplication occurred prior to speciation. This excluded some tandemly clustered genes where gene conversion is known to occur, and is thus likely to underestimate the true frequency of gene conversion. For example, the MDR1 and 3 genes in mouse are closely related, very probably through gene conversion, and were excluded from the study: the guinea-pig genes
were used instead ( Table 2). A total of 20 such gene clusters were found.
3. Results and discussion 3.1. Rates of amino acid replacement The tree of seven-transmembrane receptors with broad similarity to the chemokine receptors in Fig. 1 is not a perfect representation of the true branching order. Firstly, gene conversion among sequences complicates the pattern (see below); secondly, there is low statistical support (not shown) for many branches, particularly those that are separated only by small distances on the tree. However, the broad features of the tree are of relevance to the question of how host CRs are evolving in response to pathogen pressures. There is considerable variation in the rate of evolution of the different proteins ( Table 1). Thus, CXCR4 is evolving slowly (6.5% amino acid differences between mouse and human), while the other CRs evolve faster, with Duffy showing 35.5% amino acid differences. It might be predicted that more changes would occur in the extracellular regions, if they were responding to pathogen pressures. Intracellular regions showed a mean of 16.0% amino acid difference ( Table 1), while extracellular regions showed a higher rate of evolution, with a mean of 33.7% difference (ranging from 13.2 to 62.3%). This could be taken as evidence for adaptive evolution of the extracellular domains, but it can be just as easily argued that they are less constrained by functional requirements such as those relating to intracellular signalling. The more rapidly evolving CRs do not, however, generally have their excess changes concentrated in the extracellular regions. One way of quantifying this is by following the increase in per cent difference in extracellu-
Table 1 Percentage evolutionary differences between human and mouse chemokine receptors Name
CXCR4 CCR7 CCR4 BLR1 GPRD CCR2 CCR5 CCR1 CXCR2 CCR3 CCR10 CXCR1 Duffy a See Fig. 2.
% Amino acid difference
Gene conversiona
Total (S.E.)
Extracellular domains
Intracellular domains
6.5 (1.5) 11.2 (2.0) 11.9 (2.0) 11.9 (2.0) 15.0 (2.2) 15.4 (2.2) 15.8 (2.3) 18.5 (2.4) 23.1 (2.6) 26.5 (2.7) 28.1 (2.8) 31.2 (2.9) 35.5 (3.0)
22.6 26.4 13.2 34.0 22.6 34.0 18.9 22.6 37.7 62.3 47.2 50.9 45.3
3.5 8.8 14.0 3.5 14.0 7.0 19.3 12.3 17.5 7.0 33.3 22.8 43.9
– – N – – –
– –
242
D.C. Shields / Gene 246 (2000) 239–245
lar and intracellular regions with increasing difference in the transmembrane regions. From a regression analysis, the slope of the line is approximately 1 for both: thus a regression of per cent differences between transmembrane and extracellular regions is 1.19 (95% confidence interval 0.2–2.2). Very similar results are obtained in analysis of the intracellular regions (slope of 1.19, 95% confidence interval of 0.5–1.9). Only CCR3 shows an unusually high number of extracellular changes (62.3%) compared to intracellular (7.0%) ( Table 1). These results indicate that, overall, there is no statistically significant trend towards acceleration in extracellular domains compared with other domains within a subset of proteins. Therefore, the difference seen in CCR3 only weakly suggests a difference consistent with positive selection on extracellular domains alone. 3.2. Gene conversions within host chemokine receptors have changed amino acid sequences during evolution The two rat CXCR1 and CXCR2 proteins are more divergent than the same proteins from other species, as indicated by their separation on the tree (Fig 1). There are four possible explanations for this. (i) The rat genes have been evolving more rapidly, which is unlikely, since their distance from the outgroup CCR sequences is not increased. (ii) The rat genes are not orthologous with the human and rabbit pairs of genes. This is unlikely, since Southern blotting of genomic DNA with CXCR probes is consistent with a maximum of two CXCR1 and CXCR2 related genes in rat, human and mouse (Dunstan et al., 1996). (iii) Distances may be longer simply by chance. (iv) The rabbit and human genes have undergone extensive gene conversion, but the rat genes have not. Close inspection suggests that the latter explanation is correct. Fig. 2 shows apparent gene conversion events that have occurred within species in the CXCR1/2 gene cluster. In rabbit and man, there is evidence of extensive gene conversions of up to 100 amino acids, but there are no indications that the rat sequences have been subjected to conversion. Almost half the rabbit genes have been homogenised, from the centre of the second extracellular loop to the start of the cytoplasmic tail. The human CXCRs are homogenised from the centre of the first transmembrane domain to the end of the first intracellular loop: the direction of conversion is uncertain. It is possible that more gene conversions than those indicated in Fig. 2 have occurred at more remote evolutionary periods. The CCR chemokines also show evidence of gene conversions. A region of mouse CCR3 spanning the last extracellular loop and the seventh-transmembrane region has clearly been converted by DNA originating from murine CCR1. The mouse MIP-1 receptor, designated CCR2 here on the grounds of its close similarity
to human CCR2 (Fig. 1), has exchanged extensive stretches of sequence with mouse CCR5, spanning the first extracellular loop and covering the N terminus of the second intracellular loop ( Fig. 2); this converted region is closer to human CCR2, suggesting that murine CCR5 was converted by DNA derived from CCR2. Another stretch of murine CCR2 and CCR5 has been homogenised spanning the third intracellular loop. Among the human CCR genes, CCR2 and CCR5 show evidence of a gene conversion overlapping the C-terminus of the second intracellular loop. In general, within the chromosomal cluster, genes exchange DNA with genes with which they group most closely in the phylogenetic tree. Thus, CCR4, which is more distantly related, shows no evidence of conversion events. CCR5 and CCR2 are more closely related, and are exchanging DNA; CCR1 and CCR3 are closely related and are exchanging DNA. Almost all of the receptor sequence has been subject to gene conversions; only the ends, and part of the fourth transmembrane, domain appear to have escaped unscathed. While there are many instances of apparent gene conversions in various genes documented in the literature, it is difficult to assess its frequency from this source. The incidence of gene conversion was estimated in a sample of clustered homologues that have apparent homology between man and rodent ( Table 2). Of the 20 genes, nine showed evidence of gene conversion (at least one event with P<0.01). Thus, evidence for gene conversion among clustered homologues is not in itself unusual. Therefore, there is no proof that the gene conversions have arisen through selection, resulting from pathogen homologue pressures, or any other source. A valid criticism of the approach taken here is that the method relied on all sites, and did not restrict analysis to synonymously variable sites. This is justified for the chemokine receptors, which are too short to provide statistically significant evidence for conversion based on synonymous sites alone. It is conceivable that the observed effects are not a consequence of conversion, but instead reflect a high degree of constraint within one species on a region, which is allowed to diverge in another species. While this conclusion would be of great biological significance in understanding the biological constraints and inter-species differences for these proteins, it appears very unlikely. Clustered genes which have undergone gene conversions have a greater amino acid similarity than pairs which show no evidence of conversion ( Table 2). While this may in part reflect the homogenising influence of conversion, it probably mainly reflects the increased rate of conversion between more closely related DNA. It is worth noting that the mouse CCR1/CCR3 conversion between these two quite distantly related proteins is likely to have resulted in more amino acid changes than the typical gene conversion between clustered genes
Fig. 2. Numbering is shown for the human CCR1 protein. Only parts of the N-terminal extracellular and C-terminal cytoplasmic domains are shown. Transmembrane domains are shown in uppercase letters. Gene conversion events judged to be significant (P=0.01) stretches of DNA identity are indicated in italics. Underlined regions which are not in italics indicate adjoining sequences of strong DNA similarity, which may have accumulated a few point mutations since a conversion event. The right-hand column in bold indicates the pairs of sequences that have undergone conversions: A, human CXCR1 and CXCR2; B,, mouse CCR2 and CCR5; C rabbit CXCR1 and CXCR2; D, human CCR2 and CCR5; E, mouse CCR1 and CCR3.
D.C. Shields / Gene 246 (2000) 239–245 243
244
D.C. Shields / Gene 246 (2000) 239–245
Table 2 Gene conversion in homologous gene clusters in man and rodent SwissProt IDa
Evidence of conversionb
AA differencec
Protein function
MYSAd/MYSBd CIN1d/CIN2d CDN2e/CDN5d IL8Ad/IL8Bd CRGAd/CRGBd/CRGCd/CRCDd NEU1e/NEU2e MCPI e/EOTAe LDHMe/LDHXe MDR1f/MDR3f CP11e/CP12e CIK1d/CIK5d GAR1d/GAR2d GBAKd/GBT2e CAH1e/CAH2e CRB1d/CRB3d CATGe/GRABe CAD1e/CAD3e GB11e/GB15e SAA1e/SAA4e CKR1e/CKR3e/CKR4e/CKR5e GBI1e/GBT1e CRPe/SAMPe
Yes (P=0.000) Yes (P=0.001) Yes (P=0.003) Yes (P=0.000) Yes (P=0.000) Yes (P=0.000) No (P=0.43) No (P=0.03) Yes (P=0.000) Yes (P=0.000) Yes (P=0.002) Yes (P=0.008) No (P=0.91) No (P=0.85) No (P=0.60) No (P=0.73) No (P=0.16) No (P=0.64) No (P=0.29) Yes (P=0.000) No (P=0.46) No (P=0.98)
0.07 0.09 0.18 0.24 0.28 0.28 0.29 0.31 0.33 0.34 0.35 0.35 0.38 0.40 0.41 0.42 0.42 0.43 0.43 0.46 0.46 0.48
Myosin cardiac a and b Sodium channel protein Kinase inhibitor CXCR chemokine receptors Gamma crystallins Oxytocin/vasopressin Chemokines Lactate dehydrogenase Multidrug resistance Cytochromes Potassium channel protein GABA(A) receptor G-proteins Carbonic anhydrase Beta crystallins 1 and 3 Serine proteases Cadherin G-proteins Serum amyloid A CCR chemokine receptors G-proteins Acute-phase reactants
a The sequences were derived from SWISSPROT entries carrying the ID given suffixed with _HUMAN for the human genes and _MOUSE for rodent genes, except where other species are indicated. b P-values are for the most significant converted segment between the DNA sequences of either the two human genes or the two mouse genes. c The percentage amino acid differences between the two human proteins. For clusters with more than two genes, this is shown for the first two genes only. d Rodent sequences from rat. e Rodent sequences from mouse. f ‘Rodent’ sequences from guinea pig.
( Table 2, amino acid identity=0.45); the corresponding human gene regions differ by seven amino acids. It is possible that this in itself indicates a conversion event which is quite unlikely to have initially occurred, since the converting and replaced DNA are unlikely to have been highly similar. Adaptive selection could possibly have brought such a rare mutation to fixation. Regardless of whether it is adaptive, it is still more likely to have functional consequences than other conversions, since it is likely to have led to a number of amino acid changes. CCR3 and CCR1 share certain chemokines as ligands, but CCR3 is a specific receptor for eotaxin, and CCR1 is a specific receptor for MIP-1a. Experimentally, ligand-specificity determinants have been mapped to the third extracellular domain of CRs; the CCR3 third extracellular domain has been shown to confer some eotaxin-binding specificity when inserted into CCR1 as a chimera (Pease et al., 1998). If there is any adaptive significance to the gene conversion of the mouse sequences, it may reflect mouse CCR3 becoming more like CCR1 in its ligand-specificity. The approach used here in investigating gene conversion has a number of limitations: it assumes equal rates of nucleotide substitution, equal mutation rates among
species, lack of convergent selection for shared function among amino acid residues carried at homologous sites on different proteins. The approach has also limited itself to CRs which are in known gene clusters. Nevertheless, visual inspection of the apparent conversions ( Fig. 2) does indicate that gene conversion is likely to be taking place.
4. Conclusions The approach used here in investigating gene conversion has a number of limitations: it assumes equal rates of nucleotide substitution, equal mutation rates among species, and lack of convergent selection for shared function among amino acid residues carried at homologous sites on different proteins. The approach has also limited itself to CRs which are in known gene clusters. Many of these limitations have been imposed in order to ensure that the study has sufficient power to detect biologically important events. In spite of these limitations, visual inspection of the apparent conversions ( Fig. 2) strongly suggests that there is likely to be gene conversion taking place.
D.C. Shields / Gene 246 (2000) 239–245
Gene conversions among CRs that are chromosomally clustered have resulted in extensive and abrupt changes in amino acid sequences during evolution. Gene conversions within the MHC gene complex (Parham and Ohta, 1996) and among cytochromes (Gonzalez and Nebert, 1990) have been proposed to be adaptive. Gene conversion offers a much more rapid means of generating biologically advantageous variants than does point mutation, since the sub-sequences that are recombined have already proven their biological worth within a different context. In contrast, point mutations may provide a less reliable means of producing a useful new phenotype, since the majority of point changes will be deleterious. However, the conversion events are shown here to occur in 45% of control gene clusters. Thus, evidence of a conversion event since the rodent–human divergence may well be dominated by the comparatively high rate of neutral conversion events, of no functional importance, not only among chemokine receptors, but also for cytochromes and MHC proteins. Even though its adaptive significance is difficult to assess against a background of frequent conversions, the conversion of CCR1 and CCR3, despite their extensive overall divergence, suggests an appreciable change in amino acid sequence. This evidence, as well as the identification of an excess of extracellular changes in CCR3 in comparison with other proteins, may be consistent with adaptive pressures on CCR3 function. Gene conversions have consequences for the interpretations of species differences in any processes involving the chemokine receptor clusters. In particular, extensive differences in pathogen receptor amino acid sequences may result in important differences among animal hosts in disease progression, and these must be taken into account when interpreting the results of experimental animal models of infection where clustered CRs are known receptors for the pathogen, such as HIV.
Acknowledgements This work was supported by the Higher Education Authority (Ireland ) and the Wellcome Trust
245
(039618/Z/93/Z). I thank Andrew Lloyd and Ken Wolfe for discussion.
References Dunstan, C.A.N., Salafranca, M.N., Adhikari, S., Xia, Y., Feng, L., Harrison, J.K., 1996. Identification of two rat genes orthologous to the human interleukin-8 receptors. J. Biol. Chem. 271, 32770–32776. Fauci, A.S., 1996. Host factors and the pathogenesis of HIV-induced disease. Nature 384, 529–534. Gonzalez, F.J., Nebert, D.W., 1990. Evolution of the P450 gene superfamily: animal–plant ‘warfare’, molecular drive and human genetic differences in drug oxidation. Trends Genet. 6, 182–186. Hill, A.V.S., Motulsky, A.G., 1999. Natural selection for disease susceptibility and resistance genes: examples and prospects. In: Stearns, S.C. ( Ed.), Evolution in Health and Disease. Oxford University Press, Oxford. Keen, N.T., 1990. Gene-for-gene complementarity in plant–pathogen interactions. Annu. Rev. Genet. 24, 447–463. Marrack, P., Kappler, J., 1994. Subversion of the immune system by pathogens. Cell 76, 323–332. Mouse Genome Database, 1997. Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME. World Wide Web ( URL: http://www.informatics.jax.org/). Murphy, P.M., 1993. Molecular mimicry and the generation of host defense protein diversity. Cell 72, 823–826. Palacios, E., Digilio, L., McClure, H.M., Chen, Z., Marx, P.A., Goldsmith, M.A., Grant, R.M., 1998. Parallel evolution of CCR5-null phenotypes in humans and in a natural host of simian immunodeficiency virus. Curr. Biol. 13, 943–946. Parham, P., Ohta, T., 1996. Population biology of antigen presentation by MHC class I molecules. Science 272, 67–74. Pease, J.E., Wang, J., Ponath, P.D., Murphy, P.M., 1998. The N-terminal extracellular segments of the chemokine receptors CCR1 and CCR3 are determinants for MIP-1a and eotaxin binding, respectively, but a second domain is essential for efficient receptor activation. J. Biol. Chem. 273, 19972–19976. Rosenberg, H.F., Dyer, K.D., Tiffany, H.L., Gonzalez, M., 1995. Rapid evolution of a unique family of primate ribonuclease genes. Nat. Genet. 10, 219–223. Saitou, N., Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425. Sawyer, S., 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6, 526–538. Shields, D.C., Harmon, D., Whitehead, A.S., 1996. Evolution of hemopoietic ligands and their receptors: influence of positive selection on correlated replacements throughout ligand and receptor proteins. J. Immunol. 156, 1062–1070. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680.