Computational Biology and Chemistry 61 (2016) 245–250
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem
A computational model for predicting fusion peptide of retroviruses Sijia Wua , Jiuqiang Hana,* , Ruiling Liua , Jun Liub , Hongqiang Lva a b
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China School of Electrical Engineering, Xi’an Jiaotong University, Xi’an, China
A R T I C L E I N F O
A B S T R A C T
Article history: Received 30 June 2015 Received in revised form 27 January 2016 Accepted 10 February 2016 Available online 2 March 2016
As a pivotal domain within envelope protein, fusion peptide (FP) plays a crucial role in pathogenicity and therapeutic intervention. Taken into account the limited FP annotations in NCBI database and absence of FP prediction software, it is urgent and desirable to develop a bioinformatics tool to predict new putative FPs (np-FPs) in retroviruses. In this work, a sequence-based FP model was proposed by combining Hidden Markov Method with similarity comparison. The classification accuracies are 91.97% and 92.31% corresponding to 10-fold and leave-one-out cross-validation. After scanning sequences without FP annotations, this model discovered 53,946 np-FPs. The statistical results on FPs or np-FPs reveal that FP is a conserved and hydrophobic domain. The FP software programmed for windows environment is available at https://sourceforge.net/projects/fptool/files/?source=navbar. ã 2016 Elsevier Ltd. All rights reserved.
Keyword: Fusion peptide domain prediction Hidden Markov Method Similarity comparison
1. Introduction Retroviruses are enveloped RNA-containing viruses including human endogenous retrovirus (HERV), human immunodeficiency virus (HIV), simian immunodeficiency virus (SIV), human T-cell lymphotropic virus (HTLV) and murine leukemia virus (MLV) (Coffin, 1992; Rosenberg, 2010). These viruses require membrane fusion to enter host cell cytoplasm for reverse transcription (Sieczkarski and Whittaker, 2004). The fusion process is controlled and executed by viral envelope glycoprotein (env) (White et al., 2008a). Env is composed of a surface and trans-membrane (TM) subunits respectively mediating receptor binding and virus-cell fusion (Barnard et al., 2006). Locating at the N-terminal of TM subunit, fusion peptide (FP) represents an absolute requirement for the fusogenic function of retroviruses (Apellániz et al., 2014). Upon fusion activation, FP must insert itself obliquely into target cell membrane to disorganize locally the structure of lipid bilayer (Epand, 2003). The interaction of FP with target cell causes a formation of an intermediate pre-hairpin structure which bridges and fuses the viral and host membranes together (Apellániz et al., 2014). In contrast with targeting later steps in the retrovirus life cycle, such as reverse transcription, integration and maturation of virions, targeting membrane fusion to block retrovirus into host cells has significant advantages for therapeutic intervention (WolfGeorg et al., 2010). For example, it contributes to reducing the risk
of undesired side effects, preventing the establishment or maintenance of latent viral reservoirs and so on (Wolf-Georg et al., 2010). Consequently, the dependence of retrovirus on FP during infection process and the advantages of fusion inhibitor make FP an available and promising drug target (Wolf-Georg et al., 2010; Münch et al., 2007). However, classical database search tools are inefficient to retrieve FPs in env sequences (Table 1) and annotated FPs in online database are not sufficient for a better understanding of FP. Thus, it is of great importance to propose a computational model to predict new putative FPs (np-FPs). This proposed FP model is helpful to accelerate identification of FPs in retroviruses by reducing the sequence dataset for biochemical experiment corroboration. The main content of this thesis is arranged as follows. In Section 2, four procedures taken into account for the sequencebased FP model (Chou, 2011) will be sufficiently clarified. They are dataset construction, protein sample representation, FP prediction algorithm and proper evaluation methods. Then assessment results of FP model, FP software, predicted np-FPs and statistic hydrophobic properties about FP will be described in detail in Section 3. Eventually, validity of FP model, FP motif and evolutionary relationship about FP will be discussed in Section 4. 2. Material and methods 2.1. Datasets
* Corresponding author. E-mail address:
[email protected] (J. Han). http://dx.doi.org/10.1016/j.compbiolchem.2016.02.013 1476-9271/ ã 2016 Elsevier Ltd. All rights reserved.
All the retroviral protein sequences involved in this work were collected from NCBI database (Wheeler et al., 2006) and divided
246
S. Wu et al. / Computational Biology and Chemistry 61 (2016) 245–250
Table 1 Comparison results between FP model and three classical database search tools. Method
np-FPs with correct positions
np-FPs with wrong positions
np-FPs undetected
Percentagea
PSI-BLAST CS-BLAST HMMER3 FP model
16 30 39 111
50 57 47 3
51 30 31 3
13.68% 25.64% 33.33% 94.87%
a
The percentage of np-FPs with correct positions.
into two datasets. One benchmark dataset contains env sequences with FP annotations so as to complete the establishment of FP prediction model, and the other dataset includes env sequences without FP annotations for predicting potential np-FPs. In NCBI database, there are 124 env sequences with FP annotations related to HERV, HIV, SIV, HTLV and MLV. After looking over these data for reliability, 117 sequences (Table 2) meet following criteria are qualified to be included into benchmark dataset. The criteria stress not only env sequence to be nonrepetitive but also FP annotation to be experimentally validated and non-suspicious. For each one of benchmark dataset, FP or nonFP domain was considered as positive or negative sample to train the prediction model. Except for benchmark dataset, there are also a large amount of protein sequences downloaded from NCBI. They are env sequences without FP annotations (Table 2) relevant to the five retroviruses. These sequences were prepared for predicting more np-FPs with the proposed FP model. 2.2. Protein sample representation Two straightforward sequential samples with mathematical expressions were formulated in this work. They can truly reflect the intrinsic correlations of predicted FP with inquired env sequence. One is the observation sequence O, which was expressed as O ¼ðo1 ;:::;oT Þ
ot 2 f1; :::;Mgt 2 f1; :::;T g;
ð1Þ
where ot is t-th amino acid residue of protein O,M is the number of native amino acid types, and T is the length of inquired sequence. The other one is the state sequence Q, which was given by Q ¼ðq1 ;:::;qT Þ
qt 2 f1; :::;Ngt 2 f1; :::;T g;
ð2Þ
in which qt is the state of t-th residue indicating FP (qt = 2) or nonFP (qt = 1), and N is the number of states.
2.3. FP prediction model FP model predicts np-FP domain through two phases. Firstly, it adopted HMM method (Duda et al., 2001; Bonneville and Jin, 2013) to determine the existence and rough location of np-FP. Subsequently, it performed similarity comparison for a more precise np-FP. The prediction algorithm (Fig. 1) will be described in detail as follows. 2.3.1. HMM training Three matrixes were defined and estimated to represent HMM model, which are A, B and P. A stands for transition probability between states of FP and non-FP, B denotes emission probability of the residue under a state, and P reflects the state distribution of initial residue. The elements of these matrixes were computed by Maximum Likelihood Estimate (Pfanzagl, 1994). According to FP and non-FP annotations in sequences of benchmark dataset, elements of A and B were respectively given by: 8 cðqtþ1 ¼ j;qt ¼ iÞ > > < aij ¼ Pðqtþ1 ¼ jjqt ¼ iÞ ¼ cðqt ¼ iÞ i; j 2 f1; :::; Ng; k 2 f1; :::; Mg; cðot ¼ k;qt ¼ jÞ > > : bjk ¼ Pðot ¼ kjqt ¼ jÞ ¼ cðqt ¼ jÞ ð3Þ in which c(x) is the occurrence number of event x. Locating in the upstream of env protein containing FP domain, the initial amino acid should be a residue with non-FP state. Thus, elements of A were assumed to be: pi ¼ 1if i ¼ 1&t ¼ 0 ; ð4Þ 0 if i ¼ 2&t ¼ 0 where t = 0 represents the moment before observation and i = 1 or i = 2 denotes non-FP or FP state respectively. 2.3.2. HMM decoding Viterbi algorithm (Viterbi, 1967; David Forney, 2005) was applied to decode the most likely sequence of hidden states with trained HMM model l. The state sequence Q was feasible to
Table 2 The datasets and results for FP modeling and prediction. Groups
HERV HIV SIV HTLV MLV Total a b c d
Train/test
Scan
Env numbera
Resultd
Env numberb
np-FPsc
19 60 14 8 16 117
10-fold CV: Acc = 91.97%, Se = 97.16%, Sp = 99.99% LOOCV: Acc = 92.31%, Se = 97.19%, Sp = 99.99%
333 168,049 18,381 1048 107 187,918
39 43,139 9,908 794 66 53,946
The env sequences with FP annotations. The env sequences without FP annotations. The new putative FPs predicted by the model. The performance tested by two cross-validation methods.
S. Wu et al. / Computational Biology and Chemistry 61 (2016) 245–250
247
2.3.3. Sequence extension and np-FP candidates Rough np-FP was changed to a longer sequence for potential npFP candidates. The prolonged one centered on rough np-FP and extended 22 aa (amino acids) on each side. The extension length was determined by cross-validation and test on a species to ensure that real FP locates in this enlarged sequence. The np-FP candidates 0 s were defined as: 0
s ðiÞ ¼
argmax
s2 ðp:pþlengthðs3 ðiÞÞ1Þ
matchðs2 ðp : p þ lenðs3 Þ 1Þ;s3 ðiÞÞ p
2 ð1;:::; lenðs2 Þlenðs3 ðiÞÞþ1Þ;
ð6Þ
where s2 is the extended sequence, s3 ðiÞ is one of the annotated training FPs, lenðxÞ denotes the length of x and matchðx; yÞ represents exact amino acid match numbers between x and y. 2.3.4. Similarities calculation and prediction Three scores were put forward to conduct the similarity comparison. They were defined as: 8 0 < Score1ðiÞ ¼ matchðs ðiÞ;s3 ðiÞÞ 0 : ð7Þ Score2ðiÞ ¼ align matchðs ðiÞ;s3 ðiÞÞÞ : 0 0 Score3ðiÞ ¼jbeginðs ðiÞÞbeginðs1 Þjþjendðs ðiÞÞendðs1 Þj Here, beginðxÞ or endðxÞ denotes the beginning or ending position of x in inquired sequence, s1 is the rough np-FP and alignmatch ðx; yÞ represents exact amino acid match numbers between x and y after Smith-Waterman local alignment (Smith and Waterman, 1981). These three scores reveal the similarity or variance between np-FP candidates and training FPs or rough np-FP. On the basis of all these scores, np-FP candidate corresponding to the maximum of the first two scores together with the minimum of the last score was recognized as predicted np-FP. 2.4. Performance assessment
Fig. 1. Flow chart of FP prediction model.
determine the existence of np-FP in inquired env protein. It was computed by: Q ¼ argmax PðQjO; lÞ ¼ argmax PðQ; OjlÞ: Q
ð5Þ
Q
If the sequence was estimated to contain no np-FP domain, then it was the turn of the next inquired sequence. Otherwise, the rough np-FP location achieved was used to locate the np-FP precisely by similarity comparison.
FP prediction model was tested by 10-fold cross-validation (10fold CV) and leave-one-out cross-validation (LOOCV) for an accurate and comprehensive evaluation. The quantitative assessment results were measured by three indexes. They were given by: 8 Tp > > Se ¼ > > > Tp þ Fn > < Tn ð8Þ ; Sp ¼ > Tn þ Fp > > > > True > : Acc ¼ True þ False where Tp or Fn is the number of amino acids with FP state that were predicted to be FP or non-FP, Tn or Fp is the number of amino acids with non-FP state that were predicted to be non-FP or FP, True denotes the number of correctly predicted np-FPs and False represents the number of undetected and wrongly predicted npFPs.
Fig. 2. The hydrophobicity being analyzed about FP. The horizontal line represents the boundary of hydrophobicity and hydropathicity, the circle displays the GRAVY value of each annotated FP (A) or predicted np-FP (B), and the value near the arrow shows the percentage of hydrophobic FP or np-FP.
248
S. Wu et al. / Computational Biology and Chemistry 61 (2016) 245–250
Table 3 The hydrophobicity of FP being analyzed with GRAVY. Groups
GRAVY > 0a
GRAVY < 0b
Total
Percentage (>0)c
Training FPs All np-FPs np-FPs in HERV np-FPs in HIV np-FPs in SIV np-FPs in HTLV np-FPs in MLV
117 52,120 39 41,323 9,898 794 66
0 1,836 0 1,816 10 0 0
117 53,946 39 43,139 9,908 794 66
100% 96.62% 100% 95.79% 99.90% 100% 100%
a b c
The number of FPs or np-FPs with positive value of GRAVY. The number of FPs or np-FPs with negative value of GRAVY. The percentage of hydrophobic FP or np-FP.
properties about FPs and np-FPs involved in this work. It can be seen that 100% FPs and 96.62% np-FPs were recognized as hydrophobic domains. These results are in line with dozens of
3. Results 3.1. Model evaluation FP prediction model has gone through tests of three validation methods and additional comparison with three classical database search tools. The classification accuracies are 91.97%, 92.31% and 94.87% (Table 2) respectively corresponding to 10-fold CV, LOOCV and self-consistency validation. The number of correctly predicted np-FPs by suggested model is much more than PSI-BLAST (Altschul et al., 1997), CS-BLAST (Biegert and Söding, 2009) and HMMER3 (Finn et al., 2011) (Table 1). These results reveal that FP model proposed in this work is capable and useful for FP prediction. 3.2. FP prediction software FP prediction software has been programmed for windows environment to facilitate application of proposed model. It specifies FASTA as the format of input and output files. Each predicted result included in export files indicates clearly the starting and ending positions of np-FP in inquired sequence. The FP software is available to download and use for free at https:// sourceforge.net/projects/fptool/files/?source=navbar. 3.3. New putative FPs FP model scanned three sets for np-FPs. One set contains 187,918 env sequences without FP annotations, another one includes five sequences removed from benchmark dataset for repetition or non-experimental annotation cause, and the other one has two sequences excluded for suspicious annotation reason. In the first set, FP model discovered 53,946 np-FPs (Table 2). In the second set, five predicted np-FPs are the same with corresponding FPs annotated in NCBI. In the last set, two recognized np-FPs are both consistent with literatures (Earp et al., 2005; White et al., 2008b) that FP does not contain the furin cleavage site, but locates in the downstream of this site. All the achieved np-FPs are available to download together with FP software. 3.4. Hydrophobicity of fusion peptide Hydropathy scale devised by Kyte and Doolittle (Kyte and Doolittle, 1982) was chosen to calculate the grand average of hydropathicity (GRAVY) (Gasteiger et al., 2005) by following formula: GRAVY ¼
L 1 X Hðx½iÞ; L i¼1
ð9Þ
where L is the length of FP or np-FP and Hðx½iÞ is the hydropathy value for amino acid in position i. Positive scores of GRAVY represent hydrophobic proteins, whereas negative scores indicate hydrophilic proteins. Fig. 2 and Table 3 display the hydrophobicity
Fig. 3. The sequence logos of FPs created by Weblogo. The number along abscissa indicates the position of residues, and the ordinate represents the frequency of amino acid at each position.
S. Wu et al. / Computational Biology and Chemistry 61 (2016) 245–250
249
Fig. 4. The phylogenetic tree of FPs in five retroviruses created by MEGA.
previous studies (Apellániz et al., 2014; Nieva and Suárez, 2000) that FP domain is hydrophobic. 4. Discussion 4.1. Validity analysis of FP model FP model needs to be tested from different aspects to demonstrate its validity in np-FP prediction. Firstly, the classification accuracies of this model evaluated by increasingly used and widely recognized cross-validation methods (Chou, 2011; Kohavi, 2001) are relatively high. Secondly, np-FPs searched in seven sequences removed from benchmark dataset are all in good agreement with NCBI or literatures. Thirdly, statistic hydrophobic properties are similar between FPs annotated by experimental approaches and np-FPs annotated by proposed model. These results all confirm validity of FP model to a certain degree. 4.2. Motif analysis Weblogo software (Crooks et al., 2004; Schneider and Stephens, 1990) was adopted to generate FP motifs of five retroviruses in Fig. 3. It is obvious that FP of each virus is a conserved domain. But there are also some divergences among FPs from different viruses. For instance, FP of HERV seems less conserved than others which may be caused by small number of annotated FPs and large number of subfamilies (Belshaw et al., 2005; Gifford and Tristem, 2003) in HERV. Additionally, FPs of HIV and SIV are remarkably more similar than others which may be ascribed to the close relationship of these two viruses (Heeney et al., 2006).
4.3. Evolutionary relationship analysis MEGA software (Tamura et al., 2011) was employed to create phylogenetic tree in Fig. 4 to discuss the evolutionary relationship of FPs from different retroviruses. Clearly, it is consist with FP motifs, phylogeny of retroviruses (Weiss, 2006) and evolutionary relationship of immunosuppressive domain (Lv et al., 2014). Especially in highly related HIV and SIV (Heeney et al., 2006), the closest evolutionary relationship of FPs is in accordance with their remarkably similar motifs. Acknowledgements We are grateful to our colleagues in the School of Electronic and Information Engineering, Xi’an Jiaotong University for their help during the course of this work. This work was supported by grants from the Ph.D. Program Foundation of the Ministry of Education of China (No. 20110201110010) and the Natural Science Foundation of Shaanxi Province (No. 2012JQ8042). References Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., et al., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25 (8), 3389–3402. Apellániz, B., Huarte, N., Largo, E., Nieva, J.L., 2014. The three lives of viral fusion peptides. Chem. Phys. Lipids 181, 40–55. Barnard, R.J.O., Elleder, D., Young, J.A.T., 2006. Avian sarcoma and leukosis virusreceptor interactions: from classical genetics to novel insights into virus-cell membrane fusion. Virology 344 (1), 25–29. Belshaw, R., Katzourakis, A., Pa9 ces, J., Burt, A., Tristem, M., 2005. High copy number in human endogenous retrovirus families is associated with copying mechanisms in addition to reinfection. Mol. Biol. Evol. 22 (4), 814–817.
250
S. Wu et al. / Computational Biology and Chemistry 61 (2016) 245–250
Biegert, A., Söding, J., 2009. Sequence context-specific profiles for homology searching. Procnatl Acadsciusa 106 (10), 3770–3775. Bonneville, R., Jin, V.X., 2013. A hidden Markov model to identify combinatorial epigenetic regulation patterns for estrogen receptor a target genes. Bioinformatics 29 (6), 22–28. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273 (1), 236–247. Coffin, J.M., 1992. Structure and Classification of Retroviruses. Springer, US. Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E., 2004. WebLogo: a sequence logo generator. Genome Res. 14 (6), 1188–1190. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification. Wiley-Interscience, New York. Earp, L.J., Delos, S.E., Park, H.E., White, J.M., 2005. The many mechanisms of viral membrane fusion proteins. Curr. Top. Microbiol. Immunol. 285 (285), 25–66. Epand, R.M., 2003. Fusion peptides and the mechanism of viral fusion. Biochim. Biophys. Acta 1614 (1), 116–121. Finn, R.D., Clements, J., Eddy, S.R., 2011. HMMER web server: interactive sequence similarity searching. Nucl. Acids Res. 39 (8), W29–37. David Forney Jr., G., 2005. The Viterbi Algorithm: a personal history. Br. J. Ind. Relat. 11 (2), 259–285. Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M.R., Appel, R.D., et al., 2005. The Proteomics Protocols Handbook. Humana Press. Gifford, R., Tristem, M., 2003. The evolution, distribution and diversity of endogenous retroviruses. Virus Genes 26 (3), 291–315. Heeney, J.L., Dalgleish, A.G., Weiss, R.A., 2006. Origins of HIV and the evolution of resistance to AIDS. Science 313 (5786), 462–466. Kohavi, R., 2001. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence . Kyte, J., Doolittle, R.F., 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157 (1), 105–132. Lv, H., Han, J., Liu, J., Zheng, J., Zhong, D., Liu, R., 2014. ISD Tool 2.0: a computational model for predicting immunosuppressive domain of retroviruses. J. Theor. Biol. 360 (25), 78–82.
Münch, J., Ständker, L., Adermann, K., Schulz, A., Schindler, M., Chinnadurai, R., et al., 2007. Discovery and optimization of a natural HIV-1 entry inhibitor targeting the gp41 fusion peptide. Cell 2 (2), 263–275. Nieva, J.L., Suárez, T., 2000. Hydrophobic-at-interface regions in viral fusion protein ectodomains. Biosci. Rep. 20 (6), 519–533. Pfanzagl, J., 1994. Parametric statistical theory. J. Am. Stat. Assoc. 91 (433) . Rosenberg, N., 2010. Overview of Retrovirology. Springer, New York. Schneider, T.D., Stephens, R.M., 1990. Sequence logos: a new way to display consensus sequences. Nucl. Acids Res. 18 (20), 6097–6100. Sieczkarski, S.B., Whittaker, G.R., 2004. Viral entry. Curr. Top. Microbiol. Immunol. 285 (285), 1–23. Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147 (1), 195–197. Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., Kumar, S., 2011. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 28 (10), 2731–2739. Viterbi, A.J., 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13 (2), 260–269. Weiss, R.A., 2006. The discovery of endogenous retroviruses. Retrovirology 3 (3), 1–11. Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., et al., 2006. Database resources of the National Center for Biotechnology Information. Nucl. Acids Res. 39 (1), 11–16. White, J.M., Delos, S.E., Brecher, M., Schornberg, K., 2008a. Structures and mechanisms of viral membrane fusion proteins. Crit. Rev. Biochem. Mol. Biol. 43 (1), 189–219. White, J.M., Delos, S.E., Brecher, M., Schornberg, K., 2008b. Structures and mechanisms of viral membrane fusion proteins: multiple variations on a common theme. Crit. Rev. Biochem. Mol. Biol. 43 (3), 189–219. Wolf-Georg, F., Yu-Han, T., Matthias, S., Knut, A., Uwe, A., Hanns-Christian, T., et al., 2010. Short-term monotherapy in HIV-infected patients with a virus entry inhibitor against the gp41 fusion peptide. Sci. Transl. Med. 2 (63), 659–664.