Computational Biology and Chemistry 49 (2014) 45–50
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem
Research Article
ISDTool: A computational model for predicting immunosuppressive domain of HERVs Hongqiang Lv, Jiuqiang Han, Jun Liu ∗ , Jiguang Zheng, Dexing Zhong, Ruiling Liu School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, PR China
a r t i c l e
i n f o
Article history: Received 16 October 2013 Received in revised form 1 February 2014 Accepted 4 February 2014 Available online 12 February 2014 Keywords: Immunosuppressive domain prediction Human endogenous retrovirus ISDTool
a b s t r a c t Human endogenous retroviruses (HERVs) have been found to act as etiological cofactors in several chronic diseases, including cancer, autoimmunity and neurological dysfunction. Immunosuppressive domain (ISD) is a conserved region of transmembrane protein (TM) in envelope gene (env) of retroviruses. In vitro and vivo, evidence has shown that retroviral TM is highly immunosuppressive and a synthetic peptide (CKS-17) that shows homology to ISD inhibits immune function. ISD is probably a potential pathogenic element in HERVs. However, only less than one hundred ISDs of HERVs have been annotated by researchers so far, and universal software for domain prediction could not achieve sufficient accuracy for specific ISD. In this paper, a computational model is proposed to identify ISD in HERVs based on genome sequences only. It has a classification accuracy of 97.9% using Jack-knife test. 117 HERVs families were scanned with the model, 1002 new putative ISDs have been predicted and annotated in the human chromosomes. This model is also applicable to search for ISDs in human T-lymphotropic virus (HTLV), simian T-lymphotropic virus (STLV) and murine leukemia virus (MLV) because of the evolutionary relationship between endogenous and exogenous retroviruses. Furthermore, software named ISDTool has been developed to facilitate the application of the model. Datasets and the software involved in the paper are all available at https://sourceforge.net/projects/isdtool/files/ISDTool-1.0. © 2014 Elsevier Ltd. All rights reserved.
1. Introduction Human endogenous retroviruses (HERVs) are remnants of ancient retroviral infections. Many insertions into the genome have taken place tens of millions of years ago (Blikstad et al., 2008). It’s conservatively estimated that HERVs occupy 8% (Gifford and Tristem, 2003) of the entire human genome. Discovery, classification and nomenclature of HERVs followed an erratic path, which has led to different naming for the same sequences (Blomberg et al., 2009). In this study, the nomenclature (Kapitonov and Jurka, 2008; Smit et al., 1996) of RepBase/RepeatMasker (Jurka, 2000; Jurka et al., 2005) is used. Typical full-length HERVs are about 7–11 kb in size and consist mainly of the coding regions for gag, pro, pol and envelope gene (env), flanked on both 5 - and 3 - ends by long terminal repeats (LTR). Most HERVs exist in the human genome with incomplete structure (Jern and Coffin, 2008), which contain multiple stop codons, insertions, deletions and frame shift mutations (de Parseval et al., 2001; Kim, 2001).
∗ Corresponding author at: Room 158, 2nd East Building, 28 West Xianning Road, Xi’an, Shaanxi 710049, China. Tel.: +86 29 82668665 181; fax: +86 29 82668665 181. E-mail address:
[email protected] (J. Liu). http://dx.doi.org/10.1016/j.compbiolchem.2014.02.001 1476-9271/© 2014 Elsevier Ltd. All rights reserved.
HERVs have been found to act as etiological cofactors in several chronic diseases, including cancer, autoimmunity and neurological dysfunction (Kim, 2012). Evidence from recent studies indicated that the envelope proteins of some HERVs families were expressed preferentially in human placenta (Kamat et al., 2002; Venables et al., 1995) and several types of cancer cell lines, such as ovarian cancer cells (Wang-Johanning et al., 2007), prostate carcinoma tissues (Wang-Johanning et al., 2003a), breast cancer cell line (WangJohanning et al., 2003b) and so on. It has also been reported that a conserved region in env, named Immunosuppressive domain (ISD), was translated in neoplastic cells (Benit et al., 2001; Lindeskog et al., 1993). ISD is a highly conserved region of transmembrane protein (TM) in env of retroviruses and with a typical length of 17 Amino Acids (AA) (Naito et al., 2003). A synthetic peptide (CKS-17) which shows homology to ISD inhibits immune function in vitro (Cianciolo et al., 1981) and vivo (Nelson et al., 1985). ISD is closely related to the immunosuppressive properties of retroviral envelope proteins (Benit et al., 2001; Mangeney and Heidmann, 1998), probably a pathogenic element in HERVs (de Parseval et al., 2001; Lindeskog et al., 1993). However, only less than one hundred ISD annotations in HERVs have been annotated by researchers so far, and it is only available in limited HERVs families with much redundancy.
46
H. Lv et al. / Computational Biology and Chemistry 49 (2014) 45–50
In this paper, a computational model is proposed to identify typical ISD in HERVs based on genome sequences only. Among ten parameters of divided physicochemical property scores (DPPS) (Tian et al., 2009), hydrophobicity and hydrogen bond, that play important roles in immune system (Zhou et al., 2007), were selected for analysis. Combining position characteristic, hydrophobicity and hydrogen bond properties of amino acids with position-specific scoring matrix (PSSM), a new hybrid feature extraction approach is proposed. Then weighted support vector machine (WSVM) is used to solve the classification problem of unbalanced samples. As a predictive algorithm for ISD, the model is more specific, more accurate and practical than some universal models for domain position prediction, such as Dompro (Cheng et al., 2006) and PPM-Dom (Sun et al., 2013). The model has a high classification accuracy using Jackknife test. Using this model, massive DNA sequences corresponding to coding regions and LTR of 117 HERVs families were scanned. In addition, the model can also be applied to search for ISD in human T-lymphotropic virus (HTLV), simian T-lymphotropic virus (STLV) and murine leukemia virus (MLV) because of the evolutionary relationship between endogenous and exogenous retroviruses (Antony et al., 2004). In order to facilitate the application of this model, software named ISDTool has been developed. This software consists of two modules, ISDFindera for amino acid sequences and ISDFindern for DNA sequences. 2. Materials and methods 2.1. Datasets Amino acid sequences of HERVs, HTLV, STLV and MLV with ISD annotations were collected from NCBI (Sayers et al., 2012) at http://www.ncbi.nlm.nih.gov/index.html. Sequences of HERVs and HTLV which are associated with human beings were used to train the model. A large number of sequences without ISD annotation were downloaded from websites to search for new putative ISDs in them. DNA sequences corresponding to coding regions and non-coding LTR of 117 HERVs families were extracted from a RepeatMasker output file named ChromOut.tar.gz (hg19, GRCh37) in UCSC (Meyer et al., 2013), which is available at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips. Sequences of HTLV-1 were obtained from the Molecular Epidemiology Database (MEDB) (Araujo et al., 2012) at http://htlv1db.bahia.fiocruz.br, which involve DNA sequences corresponding to env and LTR genomic regions. Amino acid sequences of env of STLV-1 and MLV were downloaded from the UniProt Knowledgebase (UniProtKB) (Magrane and Consortium, 2011) at http://www.uniprot.org/uniprot. All Datasets above are summarized in Table 1.
Pizzo, 2012), it is preferred that sequence position characteristic and several physicochemical properties of amino acids should be chosen as the candidate features. The DPPS descriptor, which involves electronic, steric, hydrophobic and hydrogen bond properties, is then chosen to characterize physicochemical properties of amino acids. In order to select appropriate physicochemical properties which have closer relationship with immunity, Fisher values of ten parameters in DPPS are calculated on the training sample sequences. In these parameters, V8 and V9 with larger Fisher values are chosen, which is consistent with known knowledge that hydrophobicity and hydrogen bond play important roles in immune system (Zhou et al., 2007). 2.4. Construction of feature space Combining position characteristic of sequences, hydrophobicity and hydrogen bond properties of amino acids with PSSM, a new hybrid feature space construction approach is proposed. Standard PSSM is an effective way of encoding fixed size motifs into descriptors capable of being used to search sequences (Maudling and Attwood, 2004; Naughton et al., 2006), not involve physicochemical properties. In this paper, the motif has a size of 17 AA, just the length of typical ISD in envelope proteins. Twenty kinds of typical amino acids are considered. V8 and V9 physicochemical parameters in DPPS are clustered according to value similarity to meet requirements of PSSM. By PSSM, the position characteristic, hydrophobicity and hydrogen bond properties of a motif are mapped to a point in a three-dimensional feature space. 2.5. WSVM classifier The support vector machine (SVM) is a supervised machine learning algorithm based on the statistical learning theory (Vapnik, 1999). The basic thought of SVM is to map the original data into a high dimensional feature space through a nonlinear mapping function and then construct a hyperplane as a discriminative surface between the positive and negative data (Cui et al., 2013). In this paper, feature data is normalized and weighted support vector machine (WSVM) (Chang and Lin, 2011) is employed to solve the classification problem of unbalanced samples (Fig. 1).
2.2. Preparation of training samples Training samples were prepared from the amino acid sequences with ISD annotations. Repetitive and singular sequences were removed from these initial amino acid sequences. Sub-sequences corresponding to ISD region in the remaining were extracted and used as positive training samples. The negative training samples consisted of two parts. One was obtained from random sequences in the regions lack of ISD, the other from overlapping sequences of ISD. This training sample preparation process could probably improve the identification accuracy of the model. 2.3. Feature selection Because ISD is a highly conserved sequence and associated with inhibition of immune and inflammatory functions (Cianciolo and
Fig. 1. WSVM classifying hyperplane of the positive and negative training samples. Yellow points are positive training samples in the feature space, which indicate ISD sample sequences. Blue points are negative training samples, which represent nonISD sample sequences. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
H. Lv et al. / Computational Biology and Chemistry 49 (2014) 45–50
47
Table 1. Datasets and prediction results involved in this paper. Group
Source
Filter
Number
Sequence type
With ISD annotation
Use
Resulta
HERVs
NCBI UCSC UCSC
-int, ∼LTR, ∼LOR; LTR/ERVLb ∼-int, LTR; LTR/ERVL
94 94 671 148 402
Amino acids DNA DNA
y n n
Train/test Scan Scan
100% 1002 0
HTLV
NCBI MEDB MEDB
HTLV-1; envc HTLV-1; LTR
635 737 869
Amino acids DNA DNA
y n n
Train/test Scan Scan
100% 490 4
STLV
NCBI UniProtKB
STLV-1; env
183 157
Amino acids Amino acids
y n
Test Scan
100% 96
MLV
NCBI UniProtKB
MLV; env
111 97
Amino acids Amino acids
y n
Test Scan
100% 60
a b c
Percentage is the identification rate of ISD, and digits are the number of new putative ISDs predicted using the proposed model. The name of a matching repeat must contain ‘–int’, but no ‘LTR’ or ‘LOR’, and the class of the matching repeat is ‘LTR/ERVL’. A sequence corresponding to env of HTLV-1 is selected in MEDB.
Fig. 2. Exact locations of 1002 new putative ISDs of HERVs in the human chromosomes. The red highlight lines represent the exact positions of new putative ISDs of HERVs in the human chromosomes. ‘1–86’ indicates that ‘86’ number new putative ISDs were predicted in the human chromosome ‘1’. Note: There is one putative ISD with ‘4387196 chrUn’ annotation is not shown in Fig. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
48
H. Lv et al. / Computational Biology and Chemistry 49 (2014) 45–50
2.6. Performance assessment The model has a classification accuracy of 97.9% using Jack-knife test. The test is a cross-validation examination (Yuan, 1999), which allows the separation of a test sequence from the entire sample sequence suite, so that the remaining samples constitute training sets. The separated one is subsequently predicted and compared with its true value. The operation goes through iterations until each sample sequence has been traversed once (Manel et al., 1999). Jackknife test is given by Matthews (1975): C(s) =
p(s)n(s) − u(s)o(s)
(p(s) + u(s))(p(s) + o(s))(n(s) + u(s))(n(s) + o(s))
(1)
where p(s) is the number of true positives, n(s) is the number of true negatives, u(s) is the number of under-estimated and o(s) is the number of over-estimated sequences.
agrees well with common sense, because ISD is a conserved region of transmembrane protein of env in coding regions, it does not exist in non-coding LTR. The exact locations of these new putative ISDs of HERVs in the human chromosomes have been described with CIRCOS (Krzywinski et al., 2009) software (Fig. 2). 3.2. New putative ISDs of HTLV-1 DNA sequences of 737 env and 869 LTR genomic regions of HTLV1 from MEDB were scanned, 490 new putative ISDs (Supplementary part 2) were predicted in env sequences, but only 4 putative ISDs in LTR. 3.3. New putative ISDs of STLV-1 and MLV
3. Results
A total of 157 and 97 amino acid sequences of STLV-1 and MLV corresponding to env from UniProtKB were scanned, 96 and 60 new putative ISDs (Supplementary part 3 and 4) were predicted.
3.1. New putative ISDs of HERVs
3.4. ISDTool software
The proposed model was used to search for new putative ISDs of 117 HERVs families from sequences without ISD annotations. A total of 94 671 DNA sequences corresponding to coding regions and 148 402 LTR sequences of HERVs from RepeatMasker have been scanned. 1002 new putative ISDs (Supplementary part 1) were predicted in coding region sequences, but no putative ISD in LTR. It
Software named ISDTool for windows environment has been developed to facilitate the application of this model. This tool consists of two modules, ISDFindera for amino acid sequences and ISDFindern for DNA sequences, both of them receive sequences or file in FASTA format as input. As output, sequences and file are optional, and an ISD annotation will be appended to the
Fig. 3. Distribution of ISDs in HERVs. (A) The number of ISDs in HERVs of the 24 human chromosomes. (B) The number of ISDs in HERVs per bp of the 24 human chromosomes. (C) The number of ISDs in 117 HERVs families. The abscissa denotes HERVs families, which is arranged according to the alphabetical order of family names. (D) Percentage of the number of ISD in seven largest distribution families. Note: In (A) and (B), the putative ISD with ‘4387196 chrUn’ annotation is not included.
H. Lv et al. / Computational Biology and Chemistry 49 (2014) 45–50
49
Fig. 4. ISD motifs of HERVs, HTLV-1, STLV-1 and MLV.
original annotation line of an input sequence. The annotation will clearly indicate the starting and ending positions of ISD in the sequence. The software is easy to use, which automatically filters non-alphabetic characters, so that sequences in FASTA format on websites, such as NCBI, could be directly read and processed. ISDTool has a capacity to accept maximum 32 767 KB sequences input and the file input is recommended no larger than 200 MB. For details, refer to the manual coming with the software. 4. Discussion 4.1. Model validation and application As a cross-validation examination of Jack-knife test, the model has a classification accuracy of 97.9%. As a self-test of the training sample sequences, the model has an accuracy of 100% to identify ISD of sample sequences. In addition, it is found that a model, just constructed from the samples of ISD in HERVs, has an identification accuracy of 100% to the samples of ISDs in HTLV, STLV and MLV, so ISDs of HERVs and the three retroviruses probably have a very high similarity in the three features selected, which is in line with the evolutionary relationship between them and becomes evidence of the model validation. ISD is closely related to the immunosuppressive properties of retroviral envelope proteins, probably a potential pathogenic element in HERVs. The model is proposed to identify ISD in HERVs based on genome sequences only. It is hoped that this work will be helpful for those who are interested in ISD of retroviruses, especially of HERVs, and could provide valuable information to explore the relationship between HERVs and chronic diseases. 4.2. Analysis of the 4 new putative ISDs in LTR of HTLV-1 ISD is a conserved region in transmembrane protein of retroviral env and does not exist in non-coding LTR. However, in this paper, there are 4 new putative ISDs predicted by the model in LTR sequences of HTLV-1 from MEDB. The four corresponding LTR sequences, whose access numbers are Y16495, Y16496, AY342309 and AY342310, were extracted and analyzed. It is found that these four are annotated as env of HTLV-1 in NCBI as well as the European Nucleotide Archive (ENA) and all the triads have completely same DNA sequences (Supplementary part 5). Therefore, the four
sequences, annotated as LTR of HTLV-1 in MEDB, may be actually coding env, not non-coding LTR. That is why the 4 new putative ISDs can be predicted by the model in the LTR sequences of HTLV-1 from MEDB. 4.3. Distribution of ISDs in HERVs The distribution of putative ISDs of HERVs in the 24 human chromosomes has been given in Fig. 3A and B. It is obvious that chromosome Y and 19 has higher ISD content per base pair (bp). In addition, the number of ISDs in each HERVs family is much different (Fig. 3C), most ISDs belong to a few families. The seven largest distribution families occupy more than half the number of ISDs in a total of 117 HERVs families (Fig. 3D). Therefore, ISD of HERVs probably has family preference. 4.4. Motif analysis Motifs of ISD in HERVs and the other three retroviruses, including HTLV-1, STLV-1 and MLV, were detected based on the MEME version 4.9.1 (Bailey et al., 2009), which is devoted to discovery and analysis of sequence motifs. It is obviously that ISD sequence is highly conserved within each retrovirus itself. HTLV-1 and STLV-1, although involved in different species, have exactly the same ISD motif because of the close evolutionary relationship between them (Fig. 4). Acknowledgments We are grateful to our colleagues in the School of Electronic and Information Engineering, Xi’an Jiaotong University for their help during the course of this work, in particular Dr. Shanxin Zhang for critical reading and comments on the manuscript. This work was supported by grants from the National Natural Science Foundation of China (Nos. 61105021 and 61071217). Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at http://dx.doi.org/10.1016/j.compbiolchem. 2014.02.001.
50
H. Lv et al. / Computational Biology and Chemistry 49 (2014) 45–50
References Antony, J.M., van Marle, G., Opii, W., Butterfield, D.A., Mallet, F., Yong, V.W., Wallace, J.L., Deacon, R.M., Warren, K., Power, C., 2004. Human endogenous retrovirus glycoprotein-mediated induction of redox reactants causes oligodendrocyte death and demyelination. Nat. Neurosci. 7, 1088–1095. Araujo, T.H., Souza-Brito, L.I., Libin, P., Deforche, K., Edwards, D., de AlbuquerqueJunior, A.E., Vandamme, A.M., Galvao-Castro, B., Alcantara, L.C., 2012. A public HTLV-1 molecular epidemiology database for sequence management and data mining. PLoS ONE 7, e42123. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W., Noble, W.S., 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208. Benit, L., Dessen, P., Heidmann, T., 2001. Identification, phylogeny, and evolution of retroviral elements based on their envelope genes. J. Virol. 75, 11709–11719. Blikstad, V., Benachenhou, F., Sperber, G.O., Blomberg, J., 2008. Evolution of human endogenous retroviral sequences: a conceptual account. Cell. Mol. Life Sci. 65, 3348–3365. Blomberg, J., Benachenhou, F., Blikstad, V., Sperber, G., Mayer, J., 2009. Classification and nomenclature of endogenous retroviral sequences (ERVs): problems and recommendations. Gene 448, 115–123. Chang, C.-C., Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27. Cheng, J., Sweredoski, M.J., Baldi, P., 2006. DOMpro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Min. Knowl. Discov. 13, 1–10. Cianciolo, G., Hunter, J., Silva, J., Haskill, J.S., Snyderman, R., 1981. Inhibitors of monocyte responses to chemotaxins are present in human cancerous effusions and react with monoclonal antibodies to the P15(E) structural protein of retroviruses. J. Clin. Invest. 68, 831–844. Cianciolo, G.J., Pizzo, S.V., 2012. Anti-inflammatory and vasoprotective activity of a retroviral-derived peptide, homologous to human endogenous retroviruses: endothelial cell effects. PLoS ONE 7, e52693. Cui, Y., Han, J., Zhong, D., Liu, R., 2013. A novel computational method for the identification of plant alternative splice sites. Biochem. Biophys. Res. Commun. 431, 221–224. de Parseval, N., Casella, J.-F., Gressin, L., Heidmann, T., 2001. Characterization of the three HERV-H proviruses with an open envelope reading frame encompassing the immunosuppressive domain and evolutionary history in primates. Virology 279, 558–569. Gifford, R., Tristem, M., 2003. The evolution, distribution and diversity of endogenous retroviruses. Virus Genes 26, 291–315. Jern, P., Coffin, J.M., 2008. Effects of retroviruses on host genome function. Annu. Rev. Genet. 42, 709–732. Jurka, J., 2000. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16, 418–420. Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J., 2005. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467. Kamat, A., Hinshelwood, M.M., Murry, B.A., Mendelson, C.R., 2002. Mechanisms in tissue-specific regulation of estrogen biosynthesis in humans. Trends Endocrinol. Metab. 13, 122–128. Kapitonov, V.V., Jurka, J., 2008. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat. Rev. Genet. 9, 411–412, author reply 414. Kim, H.S., 2001. Sequence and phylogeny of HERV-W pol fragments. AIDS Res. Hum. Retroviruses 17, 1665–1671. Kim, H.S., 2012. Genomic impact, chromosomal distribution and transcriptional regulation of HERV elements. Mol. Cells 33, 539–544. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., Marra, M.A., 2009. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645.
Lindeskog, M., Medstrand, P., Blomberg, J., 1993. Sequence variation of human endogenous retrovirus ERV9-related elements in an env region corresponding to an immunosuppressive peptide: transcription in normal and neoplastic cells. J. Virol. 67, 1122–1126. Magrane, M., Consortium, U., 2011. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, bar009. Manel, S., Dias, J.-M., Ormerod, S.J., 1999. Comparing discriminant analysis, neural networks and logistic regression for predicting species distributions: a case study with a Himalayan river bird. Ecol. Model. 120, 337–347. Mangeney, M., Heidmann, T., 1998. Tumor cells expressing a retroviral envelope escape immune rejection in vivo. Proc. Natl. Acad. Sci. U.S.A. 95, 14920–14925. Matthews, B.W., 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451. Maudling, N., Attwood, T.K., 2004. FAN: fingerprint analysis of nucleotide sequences. Nucleic Acids Res. 32, W620–W623. Meyer, L.R., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Kuhn, R.M., Wong, M., Sloan, C.A., Rosenbloom, K.R., Roe, G., Rhead, B., et al., 2013. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69. Naito, T., Ogasawara, H., Kaneko, H., Hishikawa, T., Sekigawa, I., Hashimoto, H., Maruyama, N., 2003. Immune abnormalities induced by human endogenous retroviral peptides: with reference to the pathogenesis of systemic lupus erythematosus. J. Clin. Immunol. 23, 371–376. Naughton, B.T., Fratkin, E., Batzoglou, S., Brutlag, D.L., 2006. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites. Nucleic Acids Res. 34, 5730–5739. Nelson, M., Nelson, D.S., Spradbrow, P.B., Kuchroo, V.K., Jennings, P.A., Cianciolo, G.J., Snyderman, R., 1985. Successful tumour immunotherapy: possible role of antibodies to anti-inflammatory factors produced by neoplasms. Clin. Exp. Immunol. 61, 109–117. Sayers, E.W., Barrett, T., Benson, D.A., Bolton, E., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., Dicuccio, M., Federhen, S., et al., 2012. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 40, D13–D25. Smit, A., Hubley, R., Green, P., 1996. RepeatMasker Open-3.0. Sun, J., Jing, R., Wang, Y., Zhu, T., Li, M., Li, Y., 2013. PPM-Dom: a novel method for domain position prediction. Comput. Biol. Chem. 47C, 8–15. Tian, F., Yang, L., Lv, F., Yang, Q., Zhou, P., 2009. In silico quantitative prediction of peptides binding affinity to human MHC molecule: an intuitive quantitative structure-activity relationship approach. Amino Acids 36, 535–554. Vapnik, V.N., 1999. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10, 988–999. Venables, P.J., Brookes, S.M., Griffiths, D., Weiss, R.A., Boyd, M.T., 1995. Abundance of an endogenous retroviral envelope protein in placental trophoblasts suggests a biological function. Virology 211, 589–592. Wang-Johanning, F., Frost, A.R., Jian, B., Azerou, R., Lu, D.W., Chen, D.T., Johanning, G.L., 2003a. Detecting the expression of human endogenous retrovirus E envelope transcripts in human prostate adenocarcinoma. Cancer 98, 187–197. Wang-Johanning, F., Frost, A.R., Jian, B., Epp, L., Lu, D.W., Johanning, G.L., 2003b. Quantitation of HERV-K env gene expression and splicing in human breast cancer. Oncogene 22, 1528–1535. Wang-Johanning, F., Liu, J., Rycaj, K., Huang, M., Tsai, K., Rosen, D.G., Chen, D.T., Lu, D.W., Barnhart, K.F., Johanning, G.L., 2007. Expression of multiple human endogenous retrovirus surface envelope proteins in ovarian cancer. Int. J. Cancer 120, 81–90. Yuan, Z., 1999. Prediction of protein subcellular locations using Markov chain models. FEBS Lett. 451, 23–26. Zhou, P., Tian, F., Li, Z., 2007. A structure-based, quantitative structure-activity relationship approach for predicting HLA-A*0201-restricted cytotoxic T lymphocyte epitopes. Chem. Biol. Drug Des. 69, 56–67.