Accepted Manuscript Mutational Analysis Employing a Phylogenetic Mass Tree Approach in a Study of the Evolution of the Influenza Virus Elma H. Akand, Kevin.M. Downard PII: DOI: Reference:
S1055-7903(17)30120-3 http://dx.doi.org/10.1016/j.ympev.2017.04.005 YMPEV 5791
To appear in:
Molecular Phylogenetics and Evolution
Received Date: Revised Date: Accepted Date:
5 February 2017 29 March 2017 5 April 2017
Please cite this article as: Akand, E.H., Downard, Kevin.M., Mutational Analysis Employing a Phylogenetic Mass Tree Approach in a Study of the Evolution of the Influenza Virus, Molecular Phylogenetics and Evolution (2017), doi: http://dx.doi.org/10.1016/j.ympev.2017.04.005
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Mutational Analysis Employing a Phylogenetic Mass Tree Approach in a Study of the Evolution of the Influenza Virus
Elma H. Akand and Kevin. M. Downard* Infectious Disease Responses Laboratory University of New South Wales, Sydney, NSW, Australia
Running title:
mutational analysis employing mass trees
Word Count:
5406
Keywords:
phylogenetics, algorithm, tree, mutation, mass spectrometry
*Address for correspondence: Prof. Kevin Downard Infectious Disease Responses Laboratory Prince of Wales Clinical School University of New South Wales Sydney, NSW 2052 AUSTRALIA Tel. +61 (0)2 9385 2886 Email:
[email protected]
1
Abstract
A mass based approach has been advanced to enable mutations associated with the evolution of proteins to be both charted and interrogated using phylogenetic trees built solely from the masses of peptides generated upon protein proteolysis. The modified MassTree algorithm identifies and displays all such mutations and calculates the frequency of a particular mutation across a tree. Its significance in terms of its position(s) on the tree is scored, where mutations that occur toward the basis of the tree are weighted more favourably. A comparison with data generated from a conventional sequence based tree demonstrates the reliability of mutational analysis employing this approach. Although illustrated for the study of the evolution of influenza hemagglutinin in this work, the approach has far broader applicability and can be applied to investigate the evolution of any organism. In the case of simple microorganisms this can be achieved even without the separation of component proteins. Given the central role that mass map or fingerprint data plays in protein identification in proteomics, this work demonstrates that such data can be successfully employed in a phylogenetics strategy to better understand and predict future evolutionary trends from the perspective of functional proteins expressed by the organism.
2
Introduction
Phylogenetics is indispensible for classifying organisms, identifying genes, and even for reconstructing ancestral genomes (Zang and Rannala 2012). The objective of most studies is to construct a tree-like pattern that describes the evolutionary relationships between the organisms under investigation, where the branches of the tree display the divergence of a species from a common ancestor (Gregory 2008; O’Meara 2012). In recent years, there has been an increased focus on evolution from the perspective of functional proteins (Todd et al. 1999; Yang et al. 2009), rather than the genes that encode them, where homologous proteins of similar sequence are considered to derive from a common ancestor (Gabaldón 2005). Since protein sequences are represented by a 20 letter alphabetic code they provide more information per residue position than either DNA or RNA. As has been well described (Matsuda et al. 1994; Gupta 1998; Balaji and Srinivasan 2007; Opperdoes 2009), there are additional advantages to using protein over gene sequences to study evolution primarily due to the degeneracy of the genetic code. All but two amino acids (methionine and tryptophan) are encoded by at least two codons so that changes in the third base position have no affect on the encoded protein sequence. The availability of algorithms that can align multiple protein sequences facilitates tree construction (Boyce et al. 2014). Given that mass spectrometry is now central to the analysis (Zhang et al. 2014) and sequencing of proteins (Zhang et al. 2014; Cottrell 2011), in addition to studies of their structure and interactions (Downard 2006; Downard 2007), it follows that mass spectral data can also be exploited for phylogenetic analyses. We conceived and implemented a new innovative phylogenetic approach to chart evolution history using not gene or protein sequences, but instead the masses of peptides derived from the digestion of proteins (Lun et al. 2013). The masses of peptides produced from the proteolysis of a protein reflect the
3
sequence of that protein. Homologous proteins yield peptides with identical mass in sections of the protein that share a common sequence. When high-resolution mass spectrometry is employed, these masses are accurate to within 1-2 parts-per-million (ppm) such that even subtle changes to a peptide’s sequence will alter its mass (Wada et al. 1989). Lists of peptide masses (i.e. numbers), rather than sequence data, can be employed to construct phylogenetic trees and these “mass trees” used to trace evolutionary history following the digestion of component proteins post-recovery or within whole digests for simple microorganisms (Lun et al. 2013; Swaminathan and Downard 2014). Importantly, the mass trees were found to be highly congruent with those generated using gene sequence data (Lun et al. 2013; Swaminathan and Downard 2014). The generation of peptide mass maps (or fingerprints) for input data can be achieved rapidly and from low sample levels by mass spectrometry. These are widely employed for protein identification in proteomics applications (Zhang et al. 2014; Downard 2006). Mass maps can be recorded within a fraction of a second, in less time and with less material (typically at least an order of magnitude less) than that required for tandem mass spectrometric (MS/MS) sequencing. The sample levels used for the application described have also been demonstrated to be comparable to those required for high quality PCR sequencing (Fernandes and Downard 2014). Furthermore, no additional computational approaches are required to interpret data, predict or assemble protein sequences; processes that are not infallible. In the case of the influenza virus, we have demonstrated that mass data recorded for an uncharacterized strain can be fit onto such trees (Swaminathan and Downard 2014). This allows its relationship to previously characterized strains to be assessed and the strain’s susceptible or resistance to antiviral inhibitor drugs to be determined (Swaminathan and Downard 2014). Antiviral resistance has been widely observed in strains associated with regional influenza epidemics. Monitoring the evolution of influenza is a key requirement of
4
the global surveillance of the virus in order to mitigate its impact on human and animal health through the better prediction of future strains that aid in the formulation of efficacious vaccines (Tosh et al. 2010). Understanding mutation trends is important to this goal (Wui and Yan 2006) and evolutionary inferences drawn from the phylogenetic analysis of molecular sequences, is of general interest (Brocchieri 2001) in evolutionary biology. A feature of the original mass tree algorithm (Lun et al. 2013) was the inclusion of a factor within the distance matrix function that gave weight to peptide masses found across two mass sets where some peptide mass differences were likely associated with simple mutations, and thus the proteins from which they were derived were similar or homologous, as opposed to having sequences that were widely divergent. In this work, advances in the mass tree approach and algorithm that enable amino acid mutations identified during tree construction to be displayed on mass trees, and their frequency and significance in terms of protein evolution to be monitored are described. By definition, and unlike genetic based approaches, all the identified mutations are nonsynonymous; that is, they are of biological significance. To establish the reliability of identifying mutations using mass trees, the results are compared with those obtained from conventional gene sequence trees. Although applicable to study the evolution of any expressed proteins, and the organisms from which they were derived, the mutational analyses herein investigate the evolution of the influenza virus. Its evolution is particularly influenced by the hemagglutinin protein; the most abundant protein on the surface of the virus particle. The hemagglutinin is involved in the initial stages of infection of the host cell and its sequence is impacted by both antigenic shift and drift evolutionary mechanisms (Suzuki and Nei 2002). It is also used to define type A viruses according to specific subtypes (denoted HxNy) (Burke and Smith 2014) and distinguish type B viral lineages (Arvia et al. 2014).
5
Materials and Methods
Hemagglutinin (HA) sequence data A comparable number of strains for each virus subtype/type were selected from the GISAID (Global Initiative on Sharing All Influenza Data) EpiFlu database across the period 1973-2015 (for type A strains) and 1972-2016 (for type B strains, and from 1985 slightly ahead the emergence of the Yamagata lineage in 1988), using approximately two to three full-length non-redundant sequences per year. Sequences were checked and corrected for any missing bases using data from the NCBI influenza virus sequence and influenza research FluDB databases. A total of two sets of 83 type A H3N2, 62 type B (Victoria lineage) and 70 type B (Yamagata lineage) strains were used, with the annual World Health Organization recommended vaccine strains included in one set for each subtype/type to reflect those that were most common in circulation from one year to the next.
Peptide mass datasets Theoretical monoisotopic masses for the tryptic peptides derived from all translated gene sequences were generated using the PeptideMass algorithm (Wilkins et al. 1997) (http://web.expasy.org/peptide_mass) employing zero missed cleavage sites, no posttranslational modifications, and reflecting 100% sequence coverage.
Basis of the MassTree Algorithm The MassTree algorithm (Lun et al. 2013) reads t mass sets (M1, M2, …, Mt) each containing monoisotopic mass values greater than 200 in order to remove all single amino acid residue masses. Each set contains m/z values that represent protonated peptide ions [M+H]+ detected in a mass spectrum following the proteolytic digestion of the protein. The
6
algorithm identifies mass values that are indistinguishable (i) (within a mass error of 5ppm) across the sets and also detects pairs of values whose mass difference corresponds to a single amino acid substitution (s). The significance of single amino acid substitutions to the distance score is weighted at one half of that of peptides that share a common mass (Lun et al. 2013; Swaminathan and Downard 2014). A distance matrix is then generated through pairwise comparison of mass values across all data sets in order to construct the mass trees. The most common distance based method, a relaxed neighbor joining (NJ) approach (Evans et al. 2006), was employed using the Clearcut algorithm (Sheneman et al. 2006) (with default rooting) to assemble the trees that were visualized using the FigTree algorithm v1.4.3 (Rambaut) with midpoint rooting. NJ approaches (Saitou and Nei 1987) provide a rapid and reliable method of generating phylogenetic trees from large datasets and minimize the lengths (or evolutionary distances) along all branches. In the mass trees, the branch lengths reflect the ratio of the number of different mass values to the total within each set.
Mass tree branch node numbering and mutation annotation The Clearcut algorithm was manipulated to add branch node numbering where the root was placed between the two most distantly related groups/clades of the tree. The nodes in the tree were then numbered iteratively (n, n+1, etc.) based on their evolutionary distance from the root node (defined as n=1). A relaxed Neighbour Joining (NJ) approach, the most common distance based method, allows for a variation in evolutionary rate along the branches and, therefore, more than one node may be positioned at the same distance from the root. In such cases, branch nodes were assigned the same node number. The NJ method has been shown to be an efficient in protein phylogeny studies even if the evolutionary rate among lineages differs (Hasegawa and Fujiwara, 1993). Along the internal nodes, taxas were ordered based on their distance from the root. This order was
7
used to establish the direction of a mutation (either W to M, or M to W) where “W” represents the wild type residue, of a strain closer to the root, and “M” defines the mutated single amino acid in the second strain. The tree hierarchy was traversed starting from the pair of mass sets with the lowest distance score between them. The peptide mass (m/z) values common (within ±5ppm) to both sets were identified. The remaining masses were then compared to identify mass differences corresponding to possible single amino acid substitutions. A minimum nucleotide base difference (MNBD) criterion was applied to reduce the possible number of candidates, with amino acid substitutions associated with fewer nucleotide base changes favoured. On the rare occasion where a peptide mass of one set could be associated by a single amino acid substitution with two different peptides from the second set, the mass difference closest in value to the theoretical mutation mass value was favoured.
Calculation of weighted frequencies and mutation scores Weighted frequencies for each substitution (eg. W to M) were then computed employing equation (1):
(equation 1) where wf(x) denotes the weighted frequency of a single substitution x, L is the order of the node and Fx is the frequency of the single substitution x at that node. Weighted frequencies for each substitution were summed across the tree and a mutation score was assigned according to equation (2):
(equation 2) where S(x) denotes the sum of all weighted frequencies of a single substitution x over N branch nodes of the mass tree.
8
Sequence tree construction A sequence tree was constructed through multiple alignment of the HA protein sequences using the ClustalW algorithm (Larkin et al. 2007). Trees were built and visualised with Figtree v1.4.3 (Rambaut) with midpoint rooting.
9
Results and Discussion
Of the 20 common amino acid residues, only two (the isomers leucine and isoleucine) are indistinguishable by mass when high resolution, high mass accuracy mass spectrometry is employed. The substitution of one amino acid with another results in a change in the nominal mass of the peptide from 0 (for leucine to isoleucine) to 129 (for glycine to tryptophan) (Wada et al. 1989). For the most part, the mass difference is unique to a particular substitution. For example, the substitution of an aspartic acid residue with histidine increases the mass of the peptide by 22.0320 units. The reverse substitution (histidine to alanine) decreases the mass of the peptide by the same value. Only where substitutions result in the addition or subtraction of the same combination of atoms is the mass difference not diagnostic of a particular mutation. This occurs, for example, where alanine replaces glycine, or threonine replaces serine, since both involve the addition of a methylene (CH2) group corresponding to the addition of 14.0157 mass units. Based on differences in peptide mass alone, 64% of all the possible 380 (20 x 20 - 20) mutations can be distinguished. Where the number of possible mutations favours those associated with a single nucleotide substitution in the gene which encodes the protein, referred to as a minimum nucleotide base difference (MNBD), 82% of the all mutations can be uniquely characterized by mass. The mutation annotated mass tree for the translated HA sequences of 83 human type A H3N2 strains is shown in Figure 1. These strains include those recommended by the WHO in seasonal vaccines to influenza. The tree is colour coded to highlight clades that contain strains for the periods 1973-1983, 1982-1994, 1995-2002 and 2002-2015. The nodes across the tree are numbered according to their distance from the root, where the root node is assigned the number 1. Along the branches of the tree, the identified mutations are shown together with the number of times each mutation occurs in brackets. In some
10
cases, several mutations are indistinguishable by mass and all possible combinations are shown. For example, at branch node 3, the mutation of tyrosine to serine (Y to S) occurs once, while the second mutation may correspond to one of several possible mutations (A to G, E to D, I to V, L to V, or T to S) which all correspond to the loss of a methylene CH2 group associated with a nominal decrease of 14 mass units as described above. In addition to the tree, the mass tree algorithm outputs the frequency, or number of times a mutation occurs across the whole tree, together with a mutation score. The latter is the sum of the frequency of a single substitution, weighted according to its position relative to the root of the tree. The file for the HA dataset (denoted dataset 1) is shown in Table 1. As can be seen from the table, the most frequent mutation is that associated with the second set of mutations described above. Such mutations occur on 15 occasions across the tree and their high frequency can in part be attributed to the five indistinguishable (by mass) possibilities. The single mutation of phenylalanine (F) to serine (S), however, is assigned a higher mutation score despite its lower frequency (3 times), since it occurs more often towards the root of the tree and thus is attributed greater evolutionary importance. This mutation can be seen to occur once at nodes 1, 76 and 80; the former is weighted particularly favourably. To establish whether the frequency of the mutations, and their scores, is unique to the specific set of sequences under analysis, or is common to other data for strains of an identical subtype, and thus of evolutionary significance, the mass tree for a second set of 83 translated HA sequences of type A H3N2 strains was generated. The mutation scores and frequencies for this set (dataset 2) are also shown in Table 1. By studying mutation trends, rather than specific frequency or mutation score values, that would be expected to vary somewhat from one dataset to another, many of the mutations are seen to have similar frequencies and mutation scores. For example, the mutation combination A to G, E to D, I to V, L to V, or T to S, occurs 15 and 13 times within the two
11
trees and has a high mutation score (top 4) relative to all others. The same is true for the reverse combination G to A, D to E, V to I, V to L, or S to T. A similar trend is observed for strains in which no single amino acid substitution was found, as well as those with the single mutation K to R, or the mutation pair of G to S or A to T. The mutation G to D or A to E, both associated with the addition of a CHCO2H group to the side chain corresponding to a nominal mass increase of 58 units, occurs twice in each tree and have a similar rank and mutation score. The same is true for the mutation of E to G. The lowest ranked mutation of the second dataset (Q to H) is not found at all in the first mass tree (Figure 1 and Table 1). In some cases, such as for the mutation of F to S, although the frequency of the mutation remains the same in both trees (3 times in this case), the mutation scores differ considerably (1.026 versus 0.114) due to position of these mutations relative to others on the two trees. Relatively few mutations appear with high frequency and with a high mutation score in one dataset but not in the other. Some examples include the mutation Y to S that occurs 3 times in the first dataset, with a mutation score of 0.442, but is not detected at all in the second. The mutation of G to E occurs 7 times in the first dataset but no at all in the second. The mutation of V to M occurs twice in the first dataset with a relatively high mutation score, yet occurs only once in the second dataset with a low mutation score and ranking. For the most part, therefore, the mutation profile is independent of the actual datasets and diagnostic of the evolution of type A H3N2 strains of the virus so as to allow the data to be further interrogated for this purpose. To investigate the hemagglutinin mutation profile for viruses of a different type, mass trees were generated from two sets of sequences for type B strains of the Victoria and Yamagata lineage. The mutation annotated mass trees are shown, side-by-side, in Figure 2 and the mutation scores and frequencies are listed together in Table 2. Much greater disparity exist among these two datasets. While similar rankings and/or
12
frequencies are evident for some mutations, such as N to D or Q to E, or none at all (no mutation), the scores and frequencies for the majority of mutations differ appreciably across the two datasets. These observations are consistent with the different evolutionary histories of the Victoria and Yamagata strains following the divergence of the lineages in, or around, 1987. To investigate the reliability of identifying mutations based on mass data alone, a sequence tree was generated for the same set of full length HA sequences used to construct the mass tree shown in Figure 1. This partially annotated sequence tree is shown in Figure 3. It is important to note that, consistent with previous studies (Lun et al. 2013; Swaminathan and Downard 2014), the two trees (Figures 1 and 3) share similar topologies, in which strains share similar positions along the branches and are sorted (as coloured) according to the year in which they first appeared. Mutations identified based on a sequence comparison of 21 neighbouring strains are shown at a range of leaf positions down the tree. These are listed in Table 3, together with their node number, alongside the mutations
identified for the same pair of
sequences/strains within the mass tree. The vast majority of mutations identified on the mass tree are identical to those from the sequence tree (see column 5 of Table 3). Mutations with a common mass are boxed with those actually identified from the sequence data shown in bold. Those that were not identified, or which differed from one tree to the other, are underlined. As described previously, trees built from masses are unable to identify the substitution of isomeric leucine and isoleucine, since these residues share an identical mass. Consequently, such mutations identified at nodes 53, 70 and 72 in the sequence tree were not detected in the mass tree at the comparable nodes. At node 39 in the mass tree, the mutation of T to I was mistaken for the mutation of S to V (node 81 in the sequence tree), since they share the same mass difference (+12.0364).
13
The former involves is associated with a single nucleotide base substitution favoured by the MassTree algorithm. A number of mutations in the sequence tree involving the addition or removal of a lysine or arginine residue are also not detected in the mass tree. This is because such mutations introduce or remove a trypsin cleavage site in the protein. This results in the generation of two peptides where there was only one, or vice versa. This is evident in regions of the sequences compared comprising residues 157-176. The upper sequence contains an additional lysine residue at position 159 while the lower sequence contains an additional lysine at position 172 (Figure 3). Such substitutions result in a change to the mass values for peptides that contain or neighbour these residues. Another scenario, where the mass and sequence trees may differ, involves the presence of two mutations with one peptide segment. These are currently not able to interrogated by the MassTree algorithm. Node 49 of the sequence tree contains two mutations involving the mutation of N to D and E to V. This results in a peptide mass decrease of 28.9902 (the sum of -0.9840 and +29.9742). This mass difference is identical to that resulting from the mutation of Q to V, found at the corresponding node position (numbered 77) in the mass tree. Both are associated with the total loss of one atom of nitrogen and oxygen and the addition of one atom of hydrogen. One final situation, although rare, may result in differences in the mutations identified in the mass versus sequence trees. This occurs when the mass of a peptide resulting from a mutation coincidentally shares the same mass as another peptide segment of very different sequence. At node 56 of the mass tree the G to T mutation is identified yet it was not found at the corresponding node (number 36) of the sequence tree. In this instance the mutations G to D and N to K at neighbouring residues 160 and 161, result in the addition of a tryptic cleavage site at the latter, and the generation of a new tryptic peptide of mass 406.1932 (Figure 3). This peptide’s mass differs in value by 44.0262 mass units from a
14
small tryptic peptide segment at residues 173-176 (of mass 450.2194) of the second sequence. The mass difference of 44.0262 coincidentally corresponds to that associated with the mutation of a glycine to a threonine residue (101.0477 - 57.0215). Despite some differences in the mutations identified in the mass and sequence trees based on a comparison of random pairs of strains at the leaf nodes (shown in Figure 3), the vast majority, 56 of a total of 73 (or 77%) sampled in Table 3, are identical. This demonstrates that a mass based phylogenetic approach to identify and interrogate mutations associated with the evolution of a protein, and the organism from which it is derived, is both valid and reliable. Using the mass tree approach, mutations are identified and viewed “globally”, from the perspective of the whole expressed protein. This can allow functionally-linked mutations, or even compensatory mutations when two separate trees are compared, to be interrogated where they are located remotely in the sequence(s) or structure(s). Where the specific location of a mutation is required, the sequence of any particular peptide can be determined by tandem mass spectrometry from the same protein digest sample used to generate the mass map input data.
15
Conclusions
The implementation of a mutational analysis into a mass based phylogenetic approach has been demonstrated to be a broadly applicable, reliable and robust method that avoids the future need to generate or employ gene or protein sequences to study the evolution of organisms. The generation of peptide mass maps (or fingerprints), used by the algorithm as input data, is common to protein identification in proteomics applications where mass spectrometry is a central technology. This provides a convenient means to rapidly generate experimental datasets to be used alone, or in conjunction with theoretical mass data produced from databases of translated gene or protein sequences as shown here. The ability to display amino acid mutations on mass trees, and measure their frequency and significance in terms of protein evolution is expected to aid studies to better predict future evolutionary trends. This is of particular importance for the generation of efficacious vaccines ahead of seasonal outbreaks in the case of the influenza virus. In terms of this application, separate algorithms have been developed (Lun et al. 2012) which enable peptides from component proteins to be differentiated within mass maps of whole virus digests, thus overcoming the need to separate such proteins prior to analysis. The improved mass tree algorithm allows the evolutionary history or organisms to be reconstructed using a new methodology and new (non-sequence) data sets that should aid in our understanding of the processes of molecular evolution.
16
Acknowledgements
This work was supported in part by an Australian Research Council Discovery Project grant (DP120101167) awarded to K. Downard.
17
Figure Legends
Figure 1
Mutation annotated mass tree for the hemagglutinin sequences of 83 human
type A H3N2 strains
Figure 2
Mutation annotated mass trees for the hemagglutinin sequences of human
type B strains of the Victoria and Yamagata lineage
Figure 3
Partially annotated tree showing mutations identified from neighbouring leaf
node strains for the hemagglutinin sequences of 83 human type A H3N2 strains
Figure 4
Segments of two HA sequences showing the effect of mutations involving the
addition or loss of lysine residues on the mass of neighbouring peptides
Table Legends
Table 1
Mutations, score and frequency for the mutations identified in the mass tree
of Figure 1 (dataset 1) and of a second tree (dataset 2) comprising a different 83 translated hemagglutinin sequences of human type A H3N2 strains
Table 2
Mutations, score and frequency for the mutations identified in mass trees
comprising two different sets of translated hemagglutinin sequences of human type B strains of the Victoria lineage
Table 3
Comparison of mutations identified from the mass and sequence trees from
a common set of translated hemagglutinin sequences of human type A H3N2 strains
18
References
Arvia, R., Corcioli, F., Pierucci, F., Azzi A., 2014. Molecular Markers of Influenza B Lineages and Clades. Viruses 6, 4437-4446. Balaji, S., Srinivasan, N. 2007. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: Inferences on protein evolution. J Biosci. 32: 83-96. Boyce, K., Sievers, G., Higgins, D.G., 2014. Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Natl. Acad. Sci. U.S.A. 111, 10556-10561. Brocchieri, L., 2001. Phylogenetic Inferences from Molecular Sequences: Review and Critique. Theor. Pop. Biol. 59, 27-40. Burke, D.F., Smith, D.J. 2014. A Recommended Numbering Scheme for Influenza A HA Subtypes. PLoS One 9, e112302. Cottrell, J.S. 2011. Protein Identification using MS/MS Data. J. Proteomics 74, 1842-1851. Downard, K.M. 2006. Ions of the Interactome: The Role of MS in the Study of Protein Interactions in Proteomics and Structural Biology. Proteomics 6, 5374-5384. Downard, K.M. 2007. editor. Mass Spectrometry of Protein Interactions. John Wiley & Sons, New Jersey USA. Evans, J., Sheneman, L., Foster, J.A. 2006. Relaxed Neighbor Joining: A Fast Distancebased Phylogenetic Tree Construction Method. J. Mol. Evol. 62, 785-792. Fernandes, N.D., Downard, K.M. 2014. Incorporation of a Proteotyping Approach using Mass Spectrometry for the Surveillance of the Influenza Virus in Cell Culture. J. Clin. Microbio. 52, 725-735. Gabaldón, T. 2005. Evolution of proteins and proteomes: a phylogenetics approach. Evol. Bioinform. Online 1, 51–61. Gregory, T.R. 2008. Understanding Evolutionary Trees, Evol. Educ. Outreach. 1: 121-137.
19
Gupta, R.S. 1998. Protein phylogenies and signature sequences: A reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol. Mol. Biol. Rev. 62, 1435-1491. Hasegawa, M., Fujiwara, M. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol Phylogenet. Evol. 2, 1-5. Lun, A.T.L., Wong, J.W.H., Downard, K.M. 2012. FluShuffle and FluResort - New Algorithms to Identify Reassorted Strains of the Influenza Virus by Mass Spectrometry. BMC Bioinformatics 13, 208. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H,, Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948. Lun, A.T.L., Swaminathan, K., Wong, J.W.H., Downard, K.M. 2013. Mass Trees - A New Phylogenetic Approach and Algorithm to Chart Evolutionary History with Mass Spectrometry. Anal. Chem. 85, 5475-5482. Matsuda, H., Yamashita, H., Kaneda, Y. 1994. Molecular Phylogenetic Analysis using both DNA and Amino Acid Sequence Data and Its Parallelization, Genome Informatics, No. 5, Universal Academy Press Inc., Tokyo, pp. 120-129. O'Meara, B.C. 2012. Evolutionary Inferences from Phylogenies: A Review of Methods, Annu. Rev. Ecol. Evol. Syst. 43, 267-285. Opperdoes, F.R. 2009. Phylogenetic analysis using protein sequences, in The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Eds. Salemi, M., Lemey, P., Vandamme, A-K., 2nd. Edn. Cambridge University Press, Cambridge UK. Rambaut, A. FigTree v1.4.3. http://tree.bio.ed.ac.uk/software/figtree/ Rokas, A. 2011. Phylogenetic analysis of protein sequence data using the Randomized
20
Axelerated Maximum Likelihood (RAXML) Program. Curr. Protoc. Mol. Biol. Ch. 19, unit 19.11. Saitou, N., Nei, M. 1987. The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol. Biol. Evol. 4, 406-425. Sheneman, L., Evans, J., Foster, J.A. 2006. Clearcut: A Fast Implementation of Relaxed Neighbor Joining. Bioinformatics 22, 2823-2834. Suzuki, Y., Nei, M. 2002. Origin and evolution of influenza virus hemagglutinin genes. Mol. Biol. Evol 19, 501-509. Swaminathan, K., Downard, K.M. 2014. Evolution of Influenza Neuraminidase and the Detection of Antiviral Resistant Strains Using Mass Trees. Anal. Chem. 86, 629-637. Todd, A.E., Orengoa, C.A., Thorntona, J.M. 1999. Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol. 3: 548-556. Tosh, P.K., Jacobson, R.M., Poland, G.A. 2010. Influenza Vaccines: From Surveillance Through Production to Protection. Mayo Clin. Proc. 85, 257-273. Wada, T., Matsuo, T., Sakarai, T. 1989. Structure Elucidation of Hemoglobin Variants and other Proteins by Digit-Printing Method. Mass Spectrom. Rev. 8, 379-434. Wilkins, M.R., Lindskog, I., Gasteiger, E., Bairoch, A., Sanchez, J-C., Hochstrasser, D.F., Appel, RD. 1997. Detailed Peptide Characterisation using PeptideMass - a World-Wide Web Accessible Tool. Electrophoresis 18, 403-408. Wui, G., Yan, S-M. 2006. Mutation Trend of Hemagglutinin of Influenza A Virus: A Review from a Computational Mutation Viewpoint. Acta Pharmacol. Sinica 27, 513-526. Yang, S., Valas, R., Bourne, P.E. 2009. Evolution Studied Using Protein Structure. Eds. Gu,J., Bourne, P. in Structural Bioinformatics (Methods of Biochemical Analysis), Wiley & Sons Inc., Hoboken NJ, U.S.A. Ch. 23, pp. 559-571. Yang, Z., Rannala, B. 2012. Molecular phylogenetics: principles and practice. Nature Reviews Genetics 13, 303-314.
21
Zhang, G., Annan, R.S., Carr, S.A., Neubert, T.A. 2014. Overview of Peptide and Protein Analysis by Mass Spectrometry. In: Current Protocols in Molecular Biology, John Wiley & Sons. New Jersey, p. 10.21.1-10.21.30.
22
Figure 1
Figure 2
Type B Victoria
Type B Yamagata
Figure 3
Figure 4
406.1932
643.3198
150 180 HK80 GSYACKRGSDKSFFSRLNWLYESESKYPAL NAN83 GSYACKRGSGNSFFSRLNWLYKSESKYPAL
958.4377
450.2914
450.2914 - 406.1932 = 44.0262 Ξ ∆GT
Table 1
mutation (dataset 1)
score
frequency
mutation (dataset 2)
score
FS AG;ED;IV;LV;TS No_mutation KR NA YS IT GA;DE;VI;VL;ST DN;EQ RK GS;AT SN RG GE VM NS SC ND;QE AV GV HY AS;FY DG;EA SP SE PQ PS SI;SL VA HQ YI;YL NT YH IM;LM EG GD;AE LQ TN AP KQ NG;QA TI SA;YF IF;LF FI;FL
1.026 0.897 0.829 0.709 0.500 0.442 0.404 0.382 0.373 0.360 0.346 0.291 0.259 0.239 0.236 0.227 0.167 0.163 0.147 0.143 0.140 0.139 0.104 0.104 0.100 0.089 0.084 0.081 0.074 0.073 0.062 0.062 0.061 0.058 0.057 0.055 0.053 0.048 0.048 0.045 0.039 0.038 0.037 0.036 0.033
3 15 13 12 1 3 6 12 6 4 7 7 2 7 2 3 1 7 2 1 4 2 2 2 1 2 3 4 3 2 1 1 3 2 2 2 1 3 1 1 2 2 2 2 2
No_mutation SG;TA DN;EQ AG;ED;IV;LV;TS GA;DE;VI;VL;ST KR PL NT SA;YF SN TI GN;AQ GS;AT NS ND;QE RK PS HY VA HD IF;LF SF FS NG;QA AS;FY QL IS;LS YH GW SP NI TN IT SY EG GD;AE FQ IM;LM LH HQ SC SI;SL QG QR VM
2.300 0.746 0.578 0.409 0.369 0.357 0.333 0.333 0.294 0.283 0.223 0.218 0.200 0.185 0.179 0.174 0.16 0.151 0.143 0.143 0.125 0.120 0.114 0.103 0.096 0.094 0.086 0.077 0.077 0.076 0.067 0.064 0.059 0.057 0.043 0.043 0.040 0.040 0.036 0.030 0.030 0.029 0.025 0.024 0.024
23
frequency 9 7 12 13 13 8 1 1 4 6 3 5 5 4 4 7 2 3 1 1 3 3 3 2 2 2 2 1 1 3 1 4 2 3 2 2 1 2 1 1 2 2 1 1 1
SD;TE FV QL RV SG;TA GT WM IS;LS LH QV LP
0.031 0.030 0.023 0.020 0.018 0.018 0.017 0.014 0.014 0.013 0.013
1 1 1 1 1 1 1 1 1 1 1
VD GV MI;ML HN DG;EA LQ AP AD PQ QH
24
0.023 0.021 0.021 0.018 0.017 0.016 0.016 0.014 0.013 0.013
1 1 1 1 1 1 1 1 1 1
Table 2 mutation (type B Vic)
score
frequency
KR IG;LG DT No_mutation TI DV ND;QE AG;ED;IV;LV;TS GA;DE;VI;VL;ST EG DG;EA DN;EQ IF;LF GS;AT RG MT CT DQ DS;ET AV SG;TA VA RK SA;YF GE VM TN SE KI LP VT YH AD NS SF KN TD HN NI FS FI;FL
1.133 1 1 0.768 0.702 0.5 0.478 0.415 0.413 0.37 0.333 0.273 0.229 0.217 0.2 0.167 0.125 0.111 0.091 0.09 0.087 0.081 0.072 0.071 0.067 0.053 0.045 0.043 0.043 0.042 0.041 0.037 0.032 0.031 0.025 0.023 0.02 0.018 0.018 0.018 0.017
4 1 1 16 3 1 6 9 8 2 1 5 2 2 1 1 1 1 1 3 2 2 4 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1
mutation (type B Yama) No_mutation AG;ED;IV;LV;TS MI;ML RK YH EY ND;QE TI DN;EQ RI;RL GA;DE;VI;VL;ST PA SG;TA KR SP NS SN GD;AE TN DG;EA VT VA QP NH AS;FY GS;AT EG DS;ET QH DT MV IN GE IT FI;FL RG IF;LF
25
score
frequency
1.586 1.268 1.099 0.908 0.75 0.500 0.300 0.240 0.216 0.200 0.185 0.167 0.161 0.153 0.143 0.115 0.091 0.079 0.077 0.077 0.062 0.054 0.043 0.041 0.037 0.034 0.032 0.027 0.027 0.026 0.022 0.022 0.022 0.018 0.016 0.015 0.015
17 6 3 8 2 1 9 3 2 1 6 1 4 3 1 3 3 3 3 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
Table 3 node
mutations identified from mass tree
node
mutations identified from sequence tree
no. in common of total
71
GS;AT(1), ND;QE(1), NS(1), SN(1)
53
AT, ND, NS, SN, IL
4/5
70
GA;DE;VI;VL;ST(1),YH(1), AG;ED;IV;LV;TS(1), IS;LS(1)
74
GA, YH, IV, IS
4/4
68
GS;AT(1), SN(1), VA(1), TP(1)
54
AT, SN, VA, TP
4/4
60
IM;LM(1), IF;LF(1), KR(1)
59
IM, LF, KR, GR
3/4
65
FI;FL(1)
80
FL
1/1
56
GT(1), PQ(1), AG;ED;IV;LV;TS(1), SG;TA(1), FI;FL(1)
36
PQ, IV, TA, FL, GD, NK, KE
4/7
42
GS;AT(1), SI;SL(1), KR(1)
73
AT, SI, KR
3/3
39
HY(1), TI(1)
81
HY, SV
1/2
MNDB favoured
37
AG;ED;IV;LV;TS(1)
82
IV, SK
1/2
add K
32
SD;TE(1), IT(1)
64
EK, SD, IT
2/3
add K
43
GA;DE;VI;VL;ST(1), HY(1), QL(1), RK(1), DN;EQ(1)
72
VI, HY, QL, RK, DN, LI, IL
5/7
I to L and L to I
19
LQ(1)
60
LQ
1/1
14
HY(1), PQ(1), AG;ED;IV;LV;TS(1)
70
HY, PQ, IV, IL, SR
3/5
13
GA;DE;VI;VL;ST(1), GE(2)
71
VI, GE
2/2
21
AP(1), GA;DE;VI;VL;ST(1)
79
AP, VI
2/2
63
GS;AT(1), SI;SL(1)
76
AT, SL
2/2
66
IT(1), KR(1)
77
IT, KR, RQ
2/3
72
SI;SL(1), GE(1), LH(1), PS(1)
34
SI, GE, LH, PS
4/4
77
GE(1), QV(1)
49
GE, ND, EV, KT
1/4
79
SN(1), SP(1), TN(1)
58
SN, SP, TN
3/3
80
FS(1), LP(1), NS(1), TI(1)
46
FS, LP, NS, TI, TK
4/5
26
discrepancy explanation I to L
add R
coincidental mass pairing + add or remove K
I to L + add R
remove R
two mutations + remove K
add K
Highlights
•
Development of mass tree approach to identify and display protein mutations
•
Phylogenetics methodology which avoids the need for gene or protein sequencing
•
Application to a study of the evolution of influenza hemagglutinin
27
Graphical abstract
28