Modeling amino acid substitution patterns in orthologous and paralogous genes

Modeling amino acid substitution patterns in orthologous and paralogous genes

Molecular Phylogenetics and Evolution 42 (2007) 298–307 www.elsevier.com/locate/ympev Modeling amino acid substitution patterns in orthologous and pa...

241KB Sizes 6 Downloads 100 Views

Molecular Phylogenetics and Evolution 42 (2007) 298–307 www.elsevier.com/locate/ympev

Modeling amino acid substitution patterns in orthologous and paralogous genes Gavin C. Conant

a,*

, Gu¨nter P. Wagner b, Peter F. Stadler

c

a Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin 2, Ireland Department of Ecology and Evolutionary Biology, Yale University, POB 208106, New Haven CT, 06520, USA Lehrstuhl fu¨r Bioinformatik, Institut fu¨r Informatik, Universita¨t Leipzig, Haertelstrasse 16-18, D-04107 Leipzig, Germany b

c

Received 1 February 2006; revised 12 June 2006; accepted 6 July 2006 Available online 26 July 2006

Abstract We study to what degree patterns of amino acid substitution vary between genes using two models of protein-coding gene evolution. The first divides the amino acids into groups, with one substitution rate for pairs of residues in the same group and a second for those in differing groups. Unlike previous applications of this model, the groups themselves are estimated from data by simulated annealing. The second model makes substitution rates a function of the physical and chemical similarity between two residues. Because we model the evolution of coding DNA sequences as opposed to protein sequences, artifacts arising from the differing numbers of nucleotide substitutions required to bring about various amino acid substitutions are avoided. Using 10 alignments of related sequences (five of orthologous genes and five gene families), we do find differences in substitution patterns. We also find that, although patterns of amino acid substitution vary temporally within the history of a gene, variation is not greater in paralogous than in orthologous genes. Improved understanding of such gene-specific variation in substitution patterns may have implications for applications such as sequence alignment and phylogenetic inference.  2006 Elsevier Inc. All rights reserved. Keywords: Amino acid substitution; Evolutionary models; Protein evolution

1. Introduction One aspect of the evolution of protein-coding genes that is still imperfectly understood is at what frequency the various amino acid residues exchange with each other over time. It is clear from the differing chemistries of the amino acids and from functional studies (for example the observation that proteolytic enzyme trypsin can be converted to the substrate specificity of chymotrypsin by a single amino acid change, Hedstrom et al., 1994) that the different residues are not generally evolutionarily equivalent (Thorne et al., 1996). It may thus be fruitful to pursue models of evolution incorporating varying rates of amino acid substitution. Such models could benefit at least two areas in the

study of molecular evolution. First, correctly modeling substitution rates may be helpful for problems such as sequence alignment (where estimates of relative rates of amino acid substitution (RRAAS)1 provide alignment scores) and phylogenetic inference. Second, understanding these patterns should help us understand the factors that shape and constrain protein evolution. The most straight-forward approach to creating these models is an empirical one: determining substitution frequencies using a representative sample of protein sequences. This approach has been very successful when applied to the problem of sequence alignment. The two estimates in general use are PAM Dayhoff et al. (1972, 1978) and 1

*

Corresponding author. Fax: +353 1 679 8558. E-mail address: [email protected] (G.C. Conant).

1055-7903/$ - see front matter  2006 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2006.07.006

Abbreviations used: RRAAS, relative rate of amino acid substitution; SG, similarity groups model; LCAP, linear combination of amino acid properties model.

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

BLOSUM Henikoff and Henikoff (1992). The latter is the standard scoring matrix for packages such as BLAST (Altschul et al., 1990, 1997) and CLUSTALW (Thompson et al., 1994). The PAM and BLOSUM matrices consist of estimates of the 190 distinct pairwise amino acid substitution rates (the substitutions A fi B and B fi A cannot be distinguished with this approach). Because estimating such a large number of parameters requires a correspondingly large sequence database, any distinctive patterns of evolution in individual proteins are necessarily lost in the inherent averaging process (Thorne et al., 1996). Because of the large statistical uncertainty associated with estimating numerous parameters from small datasets such as single protein alignments, the approaches used to create PAM and BLOSUM matrices are less than optimal for studying differences in amino acid substitution rates between proteins (Koshi and Goldstein, 1998). Whether any differences in residue substitution rates in fact exist has been debated. Graur (1985) argues that most variation in protein evolution is a result of the differing amino acid composition of proteins and that constraints on amino acid properties are generally homogenous across genes. Tourasse and Li (2000), however, have found that observed substitution patterns result from the unique functional characteristics of individual protein families. In addition to the difficulty with estimating RRAAS matrices from small samples already mentioned, the empirical approach also suffers from the fact that features of the genetic code are not explicitly modeled. For instance, if the entry in such a matrix is large and negative (indicating a low substitution rate between the two residues in question), it is not clear if that is due to a rarity of mutations allowing that substitution (for two residues which differ at more than one position in the genetic code, for instance) or natural selection against such changes. For these reasons, we have not pursued an empirical approach here but instead use two models of molecular evolution which incorporate parameters to differentiate among the amino acid substitutions. Because these two models have many fewer parameters than do complete RRAAS matrices, they can be applied to much smaller datasets and yield parameters with arguably more intuitive interpretations. They also incorporate the genetic code, avoiding the potential confound mentioned above. 1.1. Models of evolution 1.1.1. Model 1: Similarity groups We refer to our first model as the similarity groups (SG) model. The principle of this model was developed by Hughes et al. (1990) as a test for directional selection. These authors divided the 20 amino acids into groups, such that the members of each group were more similar to each other than to the residues in other groups. The model further assumes two distinct rates of amino acid substitutions.

299

Conservative amino acid substitutions, where the two residues in question are similar (as defined by being in the same group), occur at rate Rc, which is generally higher that the radical substitution rate Rr (substitutions where the two residues are not in the same group). To infer selective constraint Hughes et al. then compared the rate of conservative substitutions (Rc) to radical substitutions (Rr). This approach has the advantage of generality: one can vary both how the amino acids are divided and the number of groups used. For instance, Zhang (2000) performed several analyses using two different group definitions: one of two groups (polar residues and non-polar ones) and one of six groups (inferred from several properties). Our approach follows the molecular evolution package HYPHY (Kosakovsky Pond et al., 2005) in making these two rates parameters in a model of evolution. Unlike HYPHY, we have also allowed the group definitions for an alignment to be estimated using simulated annealing (Kirkpatrick et al., 1983), an optimization approach which has been previously applied to problems such as phylogenetic inference (Salter and Pearl, 2001). Thus, for two codons A and B separated by a single nucleotide substitution, the instantaneous codon substitution rate RA,B is given by 8 > < I w  Rnucleotide ðA; BÞ If A and B code for different residues in the same group RA;B ¼ I b  Rnucleotide ðA; BÞ If A and B code for different residues in different groups > : Rnucleotide ðA;BÞ If A and B are synonymous codons ð1Þ

Here, Ib and Iw are instantaneous relative substitution rates whose ratio (Iratio = Ib/Iw) has an interpretation similar to that of Hughes et al.’s Rr/Rc. Rnucleotide gives the instantaneous nucleotide substitution rate between A and B, which depends on the model of nucleotide substitution used (for instance, the HKY 1985 model, used for all analyses here, Hasegawa et al., 1985). 1.1.2. Model 2: Linear combination of amino acid properties Our second model derives from a codon model developed by Goldman and Yang (1994). To understand these authors’ model, it is necessary to consider the properties of the various amino acid residues. Relevant properties include: (1) volume of the residue side chain, which varies from three cubic Angstroms for glycine to 170 for tryptophan. (2) Iso-electric point of the side chain measured in units of pH. We use measurements from Haig and Hurst (1991) which vary from 2.8 for aspartic acid to 10.8 for arginine. (3) Residue polarity, delimited on a relative scale from Grantham (1974) and ranging from 13 for aspartic acid to 4.9 for leucine. (4) Chemical composition of the side chain, a measurement from Grantham giving the ratio of the atomic weight of the side-chain non-carbon atoms to the weight of the side-chain carbon atoms. This measure varies from 0 for amino acids with hydrocarbon side chains such as alanine and valine (hydrogen molecules are excluded from the ratio) to 2.8 for cystine (Grantham, 1974). (5) Hydropathy, taken from Kyte and Dolittle (1982) and ranging from 4.5 for isoleucine to 4.5 for arginine. (6)

300

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

Aromaticity: taken from a principle component analysis that appears to differentiate the three aromatic residues (phenylalanine, tyrosine and tryptophan) from the remaining residues (Sneath, 1966). An obvious application of these properties is to employ them as distance measures. Epstein (1967) and Miyata et al. (1979) created distances based on residue volume and polarity, while Grantham (1974) added chemical composition to these two factors. Because the properties are not measured on a uniform scale, a distance measure must also provide relative weights to the properties. Grantham calculated these weights from RRAAS values estimated by McLachlan (1971, 1972). Work by Xia and Li (1998) indicates that, in addition to Grantham’s three factors, the isoelectric point, hydropathy and aromaticity of residues (as well as three other properties derived from principle component analysis by Sneath, 1966) are selectively constrained in protein evolution. In their model of sequence evolution, Goldman and Yang made two amino acid residues’ substitution rate proportional to the Grantham’s distance between them (Grantham, 1974). The major change that our adapted model makes is to allow the weight assigned to each property to vary between alignments. Our approach is similar to that of Koshi and Goldstein, who applied amino acid properties to the study of the HIV virus (Koshi and Goldstein, 1998; Koshi et al., 1999). Following Xia and Li (1998), we also add two properties to those of Goldman and Yang: isoelectric point measurements from Haig and Hurst (1991) and hydropathy measurements from Kyte and Dolittle (1982). For two non-synonymous codons A and B separated by a single nucleotide substitution, the instantaneous codon substitution rate RA,B is given by " # 5 X RA;B ¼ C  exp ai  DiA;B  Rnucleotide ðA; BÞ; ð2Þ i¼1

where i is taken over the five amino acid properties, and ai is the weight given to each property. DiA,B is the absolute value of the difference in the ith property between the amino acids that codons A and B code for. C is a scaling constant which models the overall selective constraint acting on the sequence. Rnucleotide is defined as above. The magnitude of an ai indicates the degree to which amino acid substitution rates differ depending on residues’ differences in that property, while C relates to the overall frequency of non-synonymous substitution. It is possible to conceive of situations where the differences among amino acid properties have little impact on substitution rates (small ais) but where the overall substitution rate is low (small C). We refer to this model as the linear combination of amino acid properties (LCAP) model. Because properties are measured on differing scales, weights for different properties cannot be directly compared. However, it is relevant to consider if an ai is significantly non-zero or differs between two alignments.

2. Methods 2.1. Sequence data Sequences were obtained from GenBank (accession numbers, alignments and phylogenies available as supplementary information) and aligned with ClustalW (Thompson et al., 1994). Phylogenetic trees were inferred with PAUP* 4.0b10 (Swofford, 2002) using the heuristic search option and the Hasegawa/Kishino/Yano model (HKY85 with empirical base frequencies and the transition/transversion ratio estimated from data, Hasegawa et al., 1985). Forcing a phylogenetic topology on sequence data which do not follow that topology can give rise to misleading model inferences (unpublished data). We examined the resulting phylogenetic trees and removed sequences where a gene’s position in the inferred gene trees conflicted with well established relationships from the species trees (for instance clading an amphibian within the amniotes in the TATA-binding protein alignments). The most likely explanation for such differences is the incorrect assignment of orthology, and we thus took the conservative approach of excluding such difficult sequences from our analysis, a choice which should not bias in our results. Gap characters were also removed prior to subsequent analysis. Trees were optimized for model parameters (intra- and inter-group substitution rates or property weights) using a purposewritten computer program. In the TATA-binding protein and rhodopsin genes, tree topologies differed in their placement of the reptiles relative to birds and mammals depending on the model of evolution used. For the analyses below we give the results for the tree topology with the higher likelihood under the MG/GY 94 model (see below), but our inferences were very similar when using the alternative topology (data not shown). All topologies are included as supplemental material. Optimal amino acid groupings (see above) for individual alignments were inferred using simulated annealing (see Supplemental Appendix 1). This procedure estimates both the amino acid groups and the associated substitution rates, Iw and Ib [see Eq. (1)]. When applying these models, we must first test whether they provide a better fit to the data than does a null model. When using nested models in a maximum likelihood context, the more complicated model (H1) cannot give a lower likelihood than the simpler model (H0) because it includes that model as a special case. Instead, one compares the ratio of the likelihoods for the two models to determine if the more complex one offers a significantly better fit. Under certain assumptions (see Section 3), twice the log of this likelihood ratio statistic (2Dln L) follows a v2 distribution, with the number of degrees of freedom (df) being the number of extra parameters in H1 (Sokal and Rohlf, 1995). However, because the number of parameters added by estimating group definitions (and hence the df) cannot be computed directly, we used parametric bootstrapping for significance testing.

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

To do so, we simulated data under the assumption that all amino acids have equal substitution rates. We refer to this model as the Muse and Gaut/Goldman and Yang model: (MG/GY94): it is a simplified form of that of Goldman and Yang (1994); see Yang and Nielsen (2000). It is also closely related to that of Muse and Gaut (1994). Thus, for two codons i and j separated by a single base substitution, this model has synonymous substitutions occurring at an instantaneous rate qij ¼ jdij  pj and non-synonymous substitutions at a rate qij ¼ x  jdij  pj . Here, j is the transition/transversion rate ratio, dij is an indicator variable taking value 1 if codons i and j differ by a transition and 0 is they differ by a transversion. pj is the empirical frequency of the changed base in codon j and x is the nonsynonymous to synonymous rate ratio. We analyzed the resulting datasets with our simulated annealing code, calculating the difference in log-likelihood between the MG/GY 94 model and the SG model. Individual simulations were quite time consuming: we performed 100 simulations per sequence alignment. From these simulations, we found the distribution of 2Dln L for the comparison of these two models (MLDSG-SIM). We also studied how well a v2 model might fit these simulated distributions, estimating the degrees of freedom using maximum likelihood (MLv2 as opposed to MLDSG-SIM). In Fig. 1 we plot twice the observed differences in log-likelihood from our simulations (MLDSGSIM; bars) and the expected distributions under the v2 distribution with the degrees of freedom that gave the maximum likelihood of seeing the observed differences MLDSG-SIM (thus, MLv2, curves). As the figure indicates,

MLv2 does not completely explain the observed likelihood differences (the boxes do not exactly follow the curves). A v2 goodness-of-fit test confirms this difference is statistically significant for three datasets (the Drosophila opsins, and the human hemoglobin and myosin genes), rejecting the null hypothesis that MLDSG-SIM from the simulations follows MLv2 (P 6 0.05) in these three cases. Nonetheless, Fig. 1 also suggests that MLv2 is generally conservative for these data. For comparative purposes, we also analyzed our datasets using the polarity-based grouping described by Zhang (2000, see Fig. 2). We again used parametric bootstrapping to test whether the estimated groups offered a better fit than did the polarity groups. 2.2. Comparing inferred groups among alignments We next asked whether the inferred amino acid groups were similar for different genes by comparing substitution definitions across alignments. Thus, if two amino acids are placed in the same group in both datasets, their substitution rate will be given by Iw in both cases. Likewise, if two residues are in different groups in both datasets, that rate will be Ib. Using this metric, for each pair of datasets in our analysis we asked how many substitution definitions were shared. Using simulation, we can ascertain how many substitutions would be shared by random groupings. If two pairs of groups (from two different alignments in this case) share more definitions than do our randomly generated pairs of groups, that indicates that the real groups in question have significant similarity to each other.

Paralogs

B Human MHC

A Human actin 1

d.f. =17.8

C Human myosin

D Human

E Drosophila

hemoglobin d.f. =18.4

d.f.=18.7

d.f.=18.5

301

0.75

opsins d.f.=18.5

Cumulative probability

0.5 0.25 0

10

20

30

10

20

30

Orthologs

G Hexokinase

F Rhodopsin 1

20

30

H TATA-binding protein d.f.=18.0

d.f=18.6

d.f.=18.5

10

10

20

30

10

I Beta actin

20

30

J Preproalbumin d.f.=17.3

d.f.=18.4

0.75 0.5 0.25 0

10

20

30

10

20

30

10

20

30

10

20

30

10 20 30 40

2Δ Fig. 1. Distribution of likelihood ratio statistics from simulations. 100 datasets were simulated under an equal-rates model (MG/GY 94). The best amino acid grouping for each simulation was then estimated. The panels show histograms of the cumulative distribution of twice the likelihood difference between these models (MLDSG-SIM above). Also shown (lines) are cumulative v2 distributions fit to these data via maximum likelihood (MLDv2, see Section 2). Df gives the maximum likelihood estimate of the degrees of freedom for each of these curves (see Section 2). Panels A–J give the observed distribution in log-likelihood ratios for each of the ten sequence alignments studied.

302

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

2.3. Temporal patterns of variation in amino acid substitution rates We studied whether substitution patterns varied within a single orthologous gene or family of gene paralogs. One approach to this problem is to allow each branch in the tree to take on its own values for the parameters describing the amino acid substitutions. However, this yields many parameters, often with very wide confidence intervals. Instead, we sequentially allowed each branch in our trees to take on its own values for amino acid substitution parameters (Ib and Iw or property weights, respectively), while fixing those values across the remainder of the tree. For each alignment, we thus calculated the log-likelihood difference between the single rate tree and each of the 2n  5 trees with two rates. We have plotted the resulting average log-likelihood ratios in Fig. 5. 3. Results We selected 10 sequence alignments for our analysis, five representing multiple paralogs from the same organism and five of orthologous genes from multiple organisms. This split represents an important distinction. For the most part, orthologous sequences represent a single function maintained through the history of the sequences, while paralogous genes may change function through time. Our five duplicated gene alignments are the actin genes, myosin genes, hemoglobin genes, and MHC class I genes from humans and the opsin genes from Drosophila, while the orthologous relationships we studied are those for preproalbumin (the molecule before cleavage into its active form), the visual pigment rhodopsin, beta actin, the enzyme hexokinase and the TATA-binding protein, a transcription factor. We do not claim these genes represent an unbiased sample of any genome, but they do represent a reasonable diversity of protein types including membrane-spanning (rhodopsin, opsin), globular (hexokinase, TATA-binding, albumin), and structural (actin, myosin) proteins. For all 10 of our alignments, the two models (SG and LCAP) which incorporate differences in amino acid substitution rates fit our data significantly better than does the single rate Muse and Gaut/Goldman and Yang model (MG/GY94, Goldman and Yang, 1994; Muse and Gaut, 1994). We discuss these results individually for each model. 3.1. SG model To determine if our estimated groups provide a better fit to the observed sequences than does a single rate model (MG/GY 94), we used parametric bootstrapping (see Section 2). In no case was a simulated likelihood ratio statistic as large as that seen in our actual data (P < 0.01). We also calculated significance values by fitting the simulated likelihood ratio statistics to a v2 distribution. As discussed, for two models H0 and H1 where H0 is a special case of H1 and where the data actually follow H0, 2Dln L is asymptot-

ically distributed v2 when the sampling distribution of the extra parameters in H1 (here the SG model) is normal (Ota et al., 2000). Cases where fixed parameters in H0 (here the MG/GY 94 model) take on values at the boundaries of their respective ranges in H1 thus will not generally have 2Dln L follow the v2 distribution (Self and Liang, 1987). In our case, the sampling distribution of interest is that of the estimated amino acid groups. It is hardly clear even what type of distribution to expect in this case. Moreover it can be argued that the models suffer from a boundary issue: the SG model degenerates to the MG/GY 94 model when one group is empty. It is thus not particularly surprising that our simulated distributions of 2Dln L shown in Fig. 1 (MLDSG-SIM above) differ from a v2 distribution (MLv2 above). We note that similar boundary issues have previously been found to make the v2 approximation conservative (Ota et al., 2000; Anisimova et al., 2001), similar to the results seen here. In Fig. 2A, we show the selective constraint Ib/Iw (ratio of radical to conservative substitutions; see above) inferred when the polarity groups (taken from Zhang, 2000; y-axis) are used to define radical and conservative substitutions and this same ratio when we estimate groups from data (x-axis). Two features are of note. First, the inferred groups always indicate stronger constraint than do the polarity groups (all points are to the left of the y–x line). Moreover, there is no strong relation between the constraint inferred with the polarity groups and that with the inferred groups. Thus, the usefulness of the polarity groups for measuring selective constraint varies depending on the gene being studied. Note that the two actin datasets (the human actin paralogs and the beta actin orthologs) show Ib > Iw: while Hughes et al., generally had Rc > Rr (1990). The obvious interpretation of this observation is one of directional selection: i.e. radical amino acid substitutions (rate Ib) occurring more commonly than conservative ones (rate Iw). However, closer analysis of the genes and inferred groups in question shows that in fact this result is due to very low overall substitution rates in these genes, meaning that only a very small fraction of the 190 possible amino acid substitutions are actually observed. In such cases, Iw and Ib can have meanings quite different than their usual ones (Supplemental Appendix 2). For clarity, we have inverted the ratio shown in Fig. 2A to Iw/Ib for these two genes and marked them each with an ‘‘*.’’ Fig. 2B compares the improvement in likelihood (twice the difference in log-likelihood: 2Dln L) over the MG/GY 94 model obtained by estimating group definitions from data (MLDSG-E, x-axis) to that obtained from fixing those groups based on polarity (MLDSG-P, y-axis). The vertical dashed line in Fig. 2B indicates a significance level of 0.05 under the assumption that MLDSG-E follows a v2 distribution. Here, the degrees of freedom used was the largest observed among our simulations (18.73, from the human myosin paralogs, Fig. 1C). Using this value is conservative for the other sequences: we find all datasets show P < 104.

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

A

B

Comparison of inferred constraints on AA substitutions

Likelihood improvements 50

2ΔlnL--MG/GY94 vs SG:Polarity groups (ML ΔSG-P)

1.2

1

Ib/Iw: Polarity groups

303

0.8

*

Paralogs Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins

0.6

* 0.4

Orthologs Rhodopsin Hexokinase TATA-binding protein Beta Actin Albumin

0.2

0

40

30

20

10

0 0

0. 2

0. 4

0. 6

0. 8

1

1. 2

0

Ib/Iw: Inferred groups

100

200

300

400

500

600

700

800

2ΔlnL--MG/GY94 vs. SG:Inferred groups (MLΔSG-E)

Fig. 2. Differences between inferred and polarity groups for the SG model. (A) Comparison of the constraint on amino acid substitutions inferred when the polarity groups are used and when groups are estimated from data. Constraint is measured as the ratio of radical (Ib) to conservative (Iw) substitutions inferred under the two groupings. The y = x line shows the expectation if both methods retrieved the same constraint. For comparative purposes, we have inverted the Ib/Iw ratio for the estimated groups in the human actin paralogs and the beta actin genes, because these two sequence alignments showed Ib > Iw (marked with ‘‘*’’ in this panel; see text for further details). (B) Observed likelihood differences between a single amino acid substitution rate model (MG/GY 94, see text) and two models which divide the amino acid residues into groups. On the x-axis are the observed likelihood differences when the groups in question are estimated from the dataset (MLDSG-E, see text). On the y-axis are the differences observed when the amino acids are divided based on their polarity (MLDSG-P). The horizontal and vertical dashed lines give significance levels for the polarity and estimated groups, respectively (see text).

Estimating groups from data offers a significantly better fit than does using polarity groups (P < 0.01, see Section 2). Nonetheless, the two transmembrane proteins (the Drosophila opsins and the rhodopsin orthologs) both show large MLDSG-P (as does hexokinase, Fig. 2B), which is reasonable, given that these first two proteins possess membrane spanning domains, where substitutions that change residue polarity may indeed be disruptive to function. Note that the human myosin and hemoglobin paralogs show no better fit using the polarity groups than under the MG/GY 94 model (Fig. 2B, horizontal dashed line, P = 0.05, v2 distribution with df = 1). Finally, we asked whether the estimated groups for any two datasets were more similar than one would expect by chance. We find seven pairwise comparisons where this is the case (P 6 0.015). Fig. 3 illustrates these relationships as a graph: nodes are datasets, and edges join nodes (datasets) with significant similarity. Not unexpectedly, the two actin datasets (which had shown Ib > Iw) have similar groups (P = 0.015). Albumin and hexokinase also share similar group definitions (P = 0.015). The remaining five significant comparisons join four datasets (the myosin, hemoglobin and opsin paralogs and the rhodopsin orthologs). All four genes show strong similarities in groups (P 6 0.004) with the exception of the myosin paralogs and the rhodopsin orthologs (P = 0.11). We also compared the estimated groups to the polarity grouping. No dataset’s estimated groups were significantly similar to the polarity groups.

Hhem

Hmyo Hact

P=0.11

Rho TBP

Hmhc

Act Dops

Hex

Paralogs

Alb Orthologs

Fig. 3. Similarity of inferred groups for our 10 datasets. Each dataset is represented as a node in a graph: edges join nodes with significant similarity (P 6 0.015) in their groups. All four members of the largest component (Hmyo, Hhem, Dop, and Rho) have similarity in groupings, but the dashed line indicates that the similarity between Hmyo and Rho is non-significant. Shaded nodes indicate paralogous genes. Key: Hact, human actins; Hmhc, human MHC class I; Hhem, human hemoglobins; Hmyo, human myosins; Dop, Drosophila opsins; Rho, rhodopsin orthologs; Hex, kexokinase orthologs; TBP, TATA-binding proteins; Act: b-Actin orthologs; and Alb, preproalbumin orthologs.

304

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

3.2. LCAP model

A Chemical Composition

B Polarity

1 Paralogs Orthologs

0.4

0.5

0.2

0

0 -0.2

-0.5

* -0.4

-1

C

Volume

D Iso-electric point

0.4 0.3

0.6

0.2

0.4

0.1

0.2

0

0

-0.1

-0.2

-0.2

-0.4

-0.3

-0.6

-0.4

E

Hydropathy

F Scaling Coefficient

0.6

0 -0.2

*

-0.4 Rhodopsin Hexokinase TATA-binding Protein Beta actin Albumin

-0.6

Rhodopsin Hexokinase TATA-binding Protein Beta actin Albumin

0.2

Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins

0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4

0.4

Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins

In all cases, the LCAP model provides a significantly better fit than does the MG/GY94 model (P < 107: note that there are five more parameters in this model than the MG/GY94 model). We determined which of our five amino acid properties (chemical composition, polarity, volume, iso-electric point, and hydropathy) were significantly non-zero (implying that there is a dependence of substitution rate on that property). To do so, we sequentially removed each property (set ai = 0) and evaluated whether the full model gave a better fit than the model without that property. We applied a Bonferonni correction to these tests, performing each at a = 0.01 (family error rate of 0.05; Sokal and Rohlf, 1995). Broadly speaking, there is variation among the datasets both in the ais as well as in which properties are significant (Fig. 4 and Table 1; values of the ais are provided in Supplementary Table 1). One potential difficulty with our model is illustrated in Fig. 4. For three cases with the chemical composition property and one with the polarity property, we have ai > 0. When ai < 0, that implies that substitutions are constrained by that property. Thus, a positive value implies that substitutions with large differences in the property are actually favored. This might at first seem to indicate some form of directional selection, but we suspect it to be artifactual, probably suggesting that the other properties in our model imperfectly capture the true nature of amino acid evolution. For instance, this observation could reflect strong purifying selection on an amino acid property not included in our model, were there a negative correlation between that property and the positively weighted one. It could also occur if the relationship between substitution rate and residues’ differences in a given property were non-linear. Since our model assumes a linear relationship (which in general provides a better fit than other relationships tried, data not shown), positive weights in one property might correct for another property that is ‘‘over-weighted’’ for certain amino acid substitutions, were the two properties in question positively correlated with each other. The positive weight for polarity seen with the TATA-binding protein may be an example of this effect, as that property is correlated with hydropathy (note however that these properties are not interchangeable as one or the other is significantly more constrained in several cases). It is also notable that the chemical composition property is somewhat free: it takes on a value of zero for several residues, meaning that chemical composition cannot be used to distinguish them. We suspect that this property is also detecting some aspect of substitution that does not completely follow our linear model. In the future, it will be desirable to test these and other possible explanations for the positive weights. We note however that the LCAP model still fits these data significantly better than does the MG/GY94 model even with the positively weighted properties removed (data not shown).

Fig. 4. Amino acid property weights (ais, panels A–E) and the value of the scaling parameter (C, panel F) under the LCAP model. Cases where ai is not significantly different from 0 are shown with a value of 0 (P > 0.01). Note that magnitudes of property coefficients cannot be compared between properties. The polarity and hydropathy properties for human hemoglobin are marked with ‘‘*’’ because these two properties are individually non-significant, but the model fit is significantly worse if both properties are removed. The likelihood listed in Table 1 is that with apolarity = 0, as this gave a higher likelihood than when ahydropathy = 0.

3.3. Patterns of evolution in duplicated and orthologous genes We sequentially allowed each branch in the trees for our 10 datasets to take on its own amino acid substitution rates, while holding the remainder of the tree to a single set of rates. By comparing the likelihoods obtained with this procedure to the single-rate likelihood, we could assess whether amino acid substitution patterns vary over the history of these genes. Fig. 5 shows the results of this analysis. One might expect that functional divergence of paralogous genes might result in a higher degree of amino acid substi-

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

305

Table 1 Sequence alignments considered and the fit of the LCAP model Gene

# Sequences

Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins Rhodopsin Hexokinase TATA binding prot b-Actin Preproalbumin

# Non-gapped sites

6 6 8 8 6 31 13 16 9 15

MG/GY94 ln L 3455.1 3195.1 26290.2 2377.9 6780.8 10970.6 14766.7 4556.0 5279.5 18031.2

1122 1005 5796 420 1083 1035 1296 528 1122 1770

LCAP ln La

Pb

df 9

2 · 10 7 · 109 1 · 10279 6 · 1017 3 · 1039 2 · 1065 8 · 1077 2 · 1025 5 · 1022 2 · 10104

3434.9 3176.3 25641.6 2338.7c 6689.8 10819.1 14588.7 4492.9 5226.5 17787.1

2 2 4 3 3 3 3 5 4 4

a

Likelihood under the LCAP model with non-significant ais set to 0 (see text and Fig. 4). P-value for the test of no improvement in fit with the LCAP model versus the MG/GY 94 model. Degrees of freedom taken from the number of significant properties found (adjacent column). c Please see Fig. 4 and caption. b

SG Model

LCAP model

B 1

20

1

Average 2ΔlnL

14 0.8

12 10

0.8

15

0.6

8

0.6 10

0.4

6 4

0.2

0.4 5

0.2 Rhodopsin Hexokinase TATA-binding Protein Orthologs Beta actin Albumin

0

Paralogs

0

Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins

Rhodopsin Hexokinase TATA-binding Protein Orthologs Beta actin Albumin

Paralogs

0

Human actin Human MHC Human myosin Human hemoglobin Drosophila opsins

2 0

Proportion of significant comparisions

A 16

Fig. 5. Patterns of rate variation among the orthologs and paralogs studied when the pattern of amino acid substitution is serially allowed to vary along individual branches of the phylogenetic tree in question (see main text). Shown are the average change in likelihood (across all branches in the tree, left axis, grey bars) and the proportion of the total comparisons significant at P = 0.05 (right axis, black bars). The dashed lines give the value of 2Dln L that would be significant (P = 0.05) under a v2 distribution with the appropriate number of degrees of freedom. (A) Results from the SG model. (B) Results from the LCAP model.

tution pattern heterogeneity than that in the orthologous genes, which, in most cases, maintain a single function across the tree. However, although the studied genes show clear variation in substitution pattern heterogeneity, there are no systematic differences between paralogs and orthologs. Although the human actin and myosin paralogs show the highest heterogeneity, the beta actin orthologs also show high average likelihood changes, while two sets of paralogs, (human hemoglobin and MHC class I), show very little heterogeneity. It may be that the differences seen in Fig. 5 are more a result of functional differences in the proteins. For instance, note that the rhodopsin orthologs and opsin paralogs, which share similar functions, both have relatively high average likelihood changes.

4. Discussion It likely comes as little surprise that there are meaningful differences in the patterns of amino acid substitution between genes, particularly as the genes studied here were chosen in part for their very different functions. This point raises the obvious question of the relationship between substitution patterns and protein function. The LCAP model in particular has the potential to probe this relationship. Studying Fig. 4, a few points can be made. The three datasets with highest aiso-electric point values are the human myosin paralogs, the human actin paralogs and the beta actin orthologs. It seems appropriate to speculate that the surface interactions required for these proteins’ functions

306

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

could result in a selective constraint on changes in residue charge. Note also that the four alignments with the most constrained volumes are, in descending order, the TATAbinding protein, the beta actin paralogs, hexokinase, and the human myosin paralogs. These are all genes whose globular structure may constrain the volumes of interior residues. On the other hand, the remaining five alignments (avolume was not significantly different from zero for the human actin paralogs) are, sorted in descending order, human hemoglobin, rhodopsin and albumin (tie), the human MHC class I paralogs and the Drosophila opsin paralogs. The presence of the two membrane-spanning proteins (opsin and rhodopsin) in this list makes us suspect that membrane-spanning domains are less constrained in residue volume. One issue not explicitly addressed here is that of site-tosite variation in rates of sequence evolution, specifically in rates of amino acid substitution. One method of modeling such variation is to consider it to be solely a result of variation in selective constraint on substitutions at different sequence positions (Nielsen and Yang, 1998; Yang et al., 2000a). Such variation in selective pressure clearly exists (Yang et al., 2000b), but another potential cause of siteto-site variation in observed substitution rates is exactly the differences in relative amino acid substitution rates considered here. In other words, because amino acid composition differs across sites, some of the variation in how often we observe a substitution at a given site will be due to compositional differences (Halpern and Bruno, 1998). In the future, it may be useful to model these two disparate effects separately. For instance, the Iw and Ib values in the SG model could be allowed to vary across sites according to a discrete gamma distribution as described by Yang (1993, 1994). We did not find evidence for more varying selective constraints across gene trees in paralogous genes as compared to orthologous genes. Given that current models suggest that duplicate preservation should be associated with functional changes in the duplicates (Li, 1980; Force et al., 1999), this result is somewhat surprising. Of course, given our small sample of duplicated and orthologous genes and the likely low power of the test employed, it is difficult to know how general the pattern observed actually is. If it is indeed widespread, that might indicate that other modes of functional divergence (such as changes in gene expression) are the dominant mode of diversification among duplicate genes. The hemoglobin genes are an example of such diversification, where many paralogs are only expressed at specific embryonic stages. However, these genes have also undergone functional changes with respect to their oxygen affinities, changes not detected in our analysis. The relative importance of different types of functional divergence in duplicate gene evolution will be an interesting topic for further study. These models of amino acid substitution have several potential applications. Estimating groups from data for

the SG model and then applying those groups to calculating Rr and Rc may improve this method’s ability to detect selection, since using groups that do not adequately match the true substitution patterns should reduce the power of such tests. The models may also help detect subfunctionalization in gene paralogs (Force et al., 1999). Subfunctionalization may result in functional divergence without directional selection. However, the relaxation of selective constraints involved could be detectable with clever applications of these models. More complete models of substitution may also help answer other questions regarding protein evolution. Two topics of particular interest are how a residue’s position in the protein’s three-dimensional structure affects its propensity for substitution and how protein–protein interactions effect substitution patterns. On this second point, research by two of us has suggested that proteins with highly conserved interactions may evolve in a manner distinct enough to yield misleading phylogenetic signals (Campos et al., 2004). Improved models of sequence evolution offer the potential for better understanding of many areas of biology. For instance, it is possible that more complex models of protein-coding gene evolution might improve phylogenetic inference, perhaps in datasets containing both recent and ancient speciation events. Of course, codon models of evolution are computationally expensive, and it is not yet clear if this cost is offset by improvements in inference quality. In this vein, we suggest that one temptation which should be generally avoided in molecular evolution studies is to simply apply the most complex model available to any given problem. We prefer instead to think of a (hopefully expanding) toolkit, with different models available to suit different kinds of questions. Acknowledgments G.C.C. thanks A. Wagner for critical comments on early versions of this manuscript. This work was supported by the Bioinformatics Initiative of the Deutsche Forschungsgemeinschaft (DFG), Grant # BIZ-6/1-2, and by Science Foundation Ireland. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ympev.2006.07.006. Supplementary Materials: GenBank accession numbers, nucleotide sequence alignments and optimized trees (MG/GY 94, LCAP, and SE models) for all 10 datasets are available. The supplementary Table gives coefficient values for our 10 datasets under the LCAP model. Supplementary Appendix 1 gives a description of our simulated annealing method and Supplementary Appendix 2 discusses reasons why the actin datasets show Ib > Iw.

G.C. Conant et al. / Molecular Phylogenetics and Evolution 42 (2007) 298–307

References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Altschul, S.F., Madden, T.L., Schaffer, A.A., et al., 1997. Gapped blast and psi-blast: a new-generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Anisimova, M., Bielawski, J.P., Yang, Z., 2001. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18, 1585–1592. Campos, P.R.A., Olivera, V.M., Wagner, G.P., Stadler, P.F., 2004. Gene phylogenies and protein–protein interactions: possible artifacts resulting from shared protein interaction partners. J. Theor. Biol. 231, 197– 202. Dayhoff, M.O., Eck, R.V., Park, C.M., 1972. A model of evolutionary change in proteins. In: Dayhoff, M.O. (Ed.), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, pp. 89–99. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1978. A model of evolutionary change in proteins. In: Dayhoff, M.O. (Ed.), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, pp. 345–352. Epstein, C.J., 1967. Non-randomness of amino acid substitutions during the evolution of proteins. Nature 215, 355–359. Force, A., Lynch, M., Pickett, F.B., et al., 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531– 1545. Goldman, N., Yang, Z., 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11, 725–736. Grantham, R., 1974. Amino acid difference formula to help explain protein evolution. Science, 185. Graur, D., 1985. Amino acid composition and the evolutionary rates of protein-coding genes. J. Mol. Evol. 22, 53–62. Haig, D., Hurst, L.D., 1991. A quantitative measure of error minimization in the genetic code. J. Mol. Evol. 33, 412–417. Halpern, A.L., Bruno, W.J., 1998. Evolutionary distances for proteincoding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15, 910–917. Hasegawa, M., Kishino, H., Yano, T., 1985. Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174. Hedstrom, L., Perona, J.J., Rutter, W.J., 1994. Converting trypsin to chymotrypsin: residue 172 is a substrate specificity determinant. Biochemistry 33, 8757–8763. Henikoff, S., Henikoff, J.G., 1992. Amino-acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915– 10919. Hughes, A.L., Ota, T., Nei, M., 1990. Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major histocompatibility complex genes in mammals. Mol. Biol. Evol. 7, 515–524. Kirkpatrick, S., Gelatt, C.D.J., Vecchi, M.P., 1983. Optimization by simulated annealing. Science 220, 671–680. Kosakovsky Pond, S.L., Frost, S.D.W., Muse, S.V., 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21, 676–679. Koshi, J.M., Goldstein, R.A., 1998. Models of natural mutations including site heterogeneity. Proteins 32, 289–295. Koshi, J.M., Mindell, D.P., Goldstein, R.A., 1999. Using physicalchemistry-based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol. Biol. Evol. 16, 173–179.

307

Kyte, J., Dolittle, R., 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132. Li, W.-H., 1980. Rate of gene silencing at duplicate loci: a theoretical study and interpretation of data from tetraploid fish. Genetics 95, 237– 258. McLachlan, A.D., 1971. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551. J. Mol. Biol. 61, 409–424. McLachlan, A.D., 1972. Repeating sequences and gene duplication in proteins. J. Mol. Biol. 64, 417–437. Miyata, T., Miyazawa, S., Yasunaga, T., 1979. Two types of amino acid substitution in protein evolution. J. Mol. Evol. 12, 219–236. Muse, S.V., Gaut, B.S., 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11, 715–724. Nielsen, R., Yang, Z., 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936. Ota, R., Waddell, P.J., Hasegawa, M., Shimodaira, H., Kishino, H., 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17, 798–803. Salter, L.A., Pearl, D.K., 2001. Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst. Biol. 50, 7–17. Self, S.G., Liang, K.-Y., 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82, 605–610. Sneath, P.H.A., 1966. Relations between chemical structure and biological activity. J. Theor. Biol. 12, 157–195. Sokal, R.R., Rohlf, F.J., 1995. Biometry, Third ed. W.H. Freeman and Company, New York. Swofford, D.L., 2002. PAUP*. Sinauer, Sunderland, MA. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680. Thorne, J.L., Goldman, N., Jones, D.T., 1996. Combining protein evolution and secondary structure. Mol. Biol. Evol. 13, 666–673. Tourasse, N.J., Li, W.-H., 2000. Selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17, 656–664. Xia, X., Li, W.H., 1998. What amino acid properties affect protein evolution? J. Mol. Evol. 47, 557–564. Yang, Z., 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10, 1396–1401. Yang, Z., 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314. Yang, Z., Nielsen, R., 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17, 32–43. Yang, Z., Nielsen, R., Goldman, N., Pedersen, A.-M.K., 2000a. Codonsubstitution models for heterogeneous selection pressure at amino acid sites. Genetics 155, 431–449. Yang, Z., Swanson, W.J., Vacquier, V.D., 2000b. Maximum likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17, 1446–1455. Zhang, J., 2000. Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. J. Mol. Evol. 50, 56–68.