pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC

pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC

Accepted Manuscript pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into g...

1MB Sizes 1 Downloads 68 Views

Accepted Manuscript pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC

Xiang Cheng, Xuan Xiao, Kuo-Chen Chou PII: DOI: Reference:

S0378-1119(17)30548-6 doi: 10.1016/j.gene.2017.07.036 GENE 42060

To appear in:

Gene

Received date: Revised date: Accepted date:

21 April 2017 8 July 2017 11 July 2017

Please cite this article as: Xiang Cheng, Xuan Xiao, Kuo-Chen Chou , pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC, Gene (2017), doi: 10.1016/ j.gene.2017.07.036

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1

pLoc-mVirus: predict subcellular localization of multilocation virus proteins via incorporating the optimal GO information into general PseAAC Xiang Cheng1, Xuan Xiao1,3*, Kuo-Chen Chou2,3*

Authors’ e-mail addresses

AN

US

Xiang Cheng: [email protected] Xuan Xiao: [email protected] Kuo-Chen Chou: [email protected]

CR

IP

T

1 Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; 2 Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; 3 The Gordon Life Science Institute, Boston, MA 02478, USA

M

*Corresponding author

AC

CE

PT

ED

Short Title: Predicting Subcellular Localization of Virus Proteins

ACCEPTED MANUSCRIPT 2 ABSTRACT

M

AN

US

CR

IP

T

Knowledge of subcellular locations of proteins is crucially important for in-depth understanding their functions in a cell. With the explosive growth of protein sequences generated in the postgenomic age, it is highly demanded to develop computational tools for timely annotating their subcellular locations based on the sequence information alone. The current study is focused on virus proteins. Although considerable efforts have been made in this regard, the problem is far from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions. This kind of multiplex proteins is particularly important for both basic research and drug design. Using the multi-label theory, we present a new predictor called “pLoc-mVirus” by extracting the optimal GO (Gene Ontology) information into the general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mVirus predictor is remarkably superior to iLoc-Virus, the state-of-the-art method in predicting virus protein subcellular localization. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mVirus/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.

AC

CE

PT

ED

Keywords: Multi-label system; GO; PseAAC, ML-GKR, Chou’s metrics

ACCEPTED MANUSCRIPT 3 I. INTRODUCTION One of the fundamental problems in cellular and molecular biology is to understand the process of how a cell is working as a basic unit of life. To really understand this, knowledge of proteins in different organelles or their subcellular localization is prerequisite. During the last two decades or so, many computational methods were developed to address this problem (see, e.g., [1-12] as well as a long list of references cited in a review paper [13]).

SC

RI

PT

But most of the existing computational methods were designed to treat the single-label system in which each of the constituent proteins has one, and only one, subcellular location. With more experimental data emerging, however, the localization of proteins in a cell is actually a multi-labelled system, where some proteins may simultaneously occur in two or more different location sites. This kind of multiplex proteins often bears some exceptional functions [14], and should not be ignored.

AC

CE

PT E

D

MA

NU

About 10 years ago, considerable efforts have been made to explore this kind of multiplex protein systems [15; 16]. In comparison with the single-label systems, it would be much more difficult and complicated to deal with the multilabel systems. Particularly, it is extremely difficult for a multi-label predictor to yield a descent result for the “absolute true” rate. The reason is as follows. Suppose a virus protein is labeled with “1”, “2” and “3”, meaning that it may simultaneously exist in subcellular locations 1, 2, and 3 in the real world. If its predicted result is “1 and 2”, or “1, 2, 3, and 4”, no score at all will be added for the absolute true rate. When and only when the predicted result is also “1, 2, and 3” meaning perfectly identical to the actual labels, will one score be added in calculating the absolute true rate. Therefore, it is the harshest metrics in measuring the quality of a multi-label predictor [17]. And that was why in proposing their multi-label predictors, many authors even did not mention the term of “absolute true rate”. Interestingly, in developing various powerful techniques (see, e.g., [18-25]) for studying a series of important problems in medicinal chemistry, bioinformatics, and systems biology, Gonzalez-Díaz H. and his coauthors had pointed out the impotence of the “absolute true rate” although they never used their techniques to address the multi-label system for protein subcellular localization. In this study, we are to use the multi-label theory [17] to develop a new predictor to identify the subcellular localization of virus proteins aimed at improving its absolute true and absolute false rates, the two most important and harshest metrics for a multi-label predictor [17]. 2. MATERIALS AND METHODS 2.1. Benchmark Dataset According to the Chou’s 5-step rule [26] for developing a statistical

ACCEPTED MANUSCRIPT 4

D

MA

NU

SC

RI

PT

predictor, the first important thing is to construct or select a valid benchmark dataset to train and test the model as done in [27-35]. In literature, the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for the purpose of training a proposed model, while the latter for the purpose of testing it. But as elucidated in [13], it would suffice with one good quality benchmark dataset if the model is tested by the jackknife or subsampling (K-fold cross-validation) test because the outcome thus obtained is actually from a combination of many different independent dataset tests. In this study, the benchmark dataset was taken from [15; 16]. The reasons to do so are as follows. (1) The dataset contains statistically significant number of virus proteins with both single location and multiple locations confirmed by experiments, and none of the proteins included has ≥ 25% pairwise sequence identity to any other in a same subset. But such a cutoff treatment was not imposed for the protein sequences in the “viral capsid” subset because otherwise it would contain too few proteins to be of statistical significance as originally explained in [15]. (2) It is also the same benchmark dataset used to train and test iLoc-Virus [16], the state-of-the-art predictor in this area, and hence will facilitate the comparison on a same condition. For readers’ convenience, the benchmark dataset is given in Supporting Information S1. It contains 𝑁(seq) = 207 sequence-different virus proteins classified into 6 subsets according to their subcellular locations. An overall view of these proteins in the 6 subcellular locations is given in Supporting Information S2, from which we can see that, of the 207 different virus proteins, 165 belong to one location; 39 to two locations; 3 to three locations, and none to 4 or more locations.

𝑁vir = ∑

𝑁seq 𝑘=1

PT E

A breakdown of the 𝑁seq =207 virus proteins according to their occurrences in the 6 different subcellular locations is given in Table 1, where 𝑛L (𝑘)

(1)

AC

CE

is the total number of “virtual proteins” [36] or “locative proteins” [37] in the benchmark dataset, and 𝑛L (𝑘) is the number of different labels (or subcellular locations) marked on the k-th sequence-different virus protein. Accordingly, the multiplicity degree MD [38] of the current benchmark dataset is 𝑁

MD =

seq L ∑𝑘=1 𝑛 (𝑘)

𝑁seq

=

𝑁vir = 1.217 𝑁seq

(2)

As we can see from Eq.2, MD = 1 means the system containing no protein with more than one location, while MD > 1 means some proteins having more than one location. The higher the value of MD, the more protein samples that have multiple labels. For instance, MD = 1 for most of existing protein subcellular prediction methods without covering the multi-label proteins; it is 1.146 for Euk-mPLoc 2.0 [39] and iLoc-Euk [40], and 1.185 for Hum-mPloc [36] and iLoc-Hum [37].

ACCEPTED MANUSCRIPT 5 For simplifying the description later, the benchmark dataset is denoted by 𝕊, which can be further formulated as 𝕊 = 𝕊1 ⋃𝕊2 ⋃ ⋯ ⋃𝕊𝑢 ⋃ ⋯ ⋃𝕊5 ⋃𝕊6

(3)

where 𝕊1 only contains the virus protein samples from the “Viral capsid” location (cf. Table 1), 𝕊2 only contains those from the “Host cell membrane” location, and so forth; ⋃ denotes the symbol for “union” in the set theory. 2.2. Proteins Sample Formulation

RI

PT

Now let us consider the 2nd step of the 5-step rule [26]; i.e., how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned. Given a virus protein sequence P, its most straightforward expression is

SC

𝐏 = R1 R 2 R 3 R 4 R 5 R 6 R 7 ⋯ R 𝐿

(4)

PT E

D

MA

NU

where L denotes the protein’s length or the number of its constituent amino acid residues, R1 is the 1st residue, R 2 the 2nd residue, R 3 the 3rd residue, and so forth. Since all the existing machine-learning algorithms, such as SVM (Support Vector Machine) [41], KNN (K-Nearest Neighbor) [42], and RF (Random Forest) [43], can only handle vectors [44], we have to convert the sequential expression of Eq.4 into a vector. But a vector defined in a discrete model might completely lose all the sequence-order information. To deal with this problem, the PseAAC (Pseudo Amino Acid Composition) was introduced [45]. Subsequently, various modes of PseAAC were proposed to grasp different sequence patterns that are essential to various different targets concerned (see, e.g., [46-57] and a long list of references cited in three review papers [58-60] and two open open access papers [61; 62]). According to the concept of general PseAAC [26], any protein sequence can be formulated as a PseAAC vector given by (5)

CE

𝐏 = [Ψ1 Ψ2 ⋯ Ψ𝑢 ⋯ ΨΩ ]𝐓

AC

where T is a transpose operator, while the integer Ω is a parameter and its value as well as the components Ψ𝑢 (𝑢 = 1, 2, ⋯ , Ω) will depend on how to extract the desired information from the amino acid sequence of P, as elaborated below. Being one type of general PseAAC [26], the GO (Gene Ontology) has been widely used to improve the prediction quality of protein subcellular localization (see, e.g., [16; 63-65]). The advantage of using the GO approach is that proteins mapped into the GO space (instead of Euclidean space or any other simple geometric space) would be better clustered according to their subcellular locations, as elaborated in [66]. For the rationale of using the GO approach to predict the protein subcellular localization, and an incisive discussion to justify the GO approach, see Section VI of a comprehensive review paper [17]. However, the existing GO approaches (see, e.g., [16; 63; 64; 67]) have the following shortcomings. (1) Only the digital numbers 0 and 1 (or their simple combination) were used to incorporate the GO information, and hence some

ACCEPTED MANUSCRIPT 6 important information may be missed. (2) The dimension of the protein vectors, namely Ω of Eq.5, in the previous GO approaches was very high; e.g., it is 1,930 in [68], 3,043 in [38], and 9,567 in [69], and hence may lead to the high-dimension disaster problem [70]. Here, we are to introduce a novel GO approach, through which we can grasp the key information by winnowing many trivial ones so as to significantly reduce the dimension of PseAAC vector of Eq.5. The detailed procedures are as follows.

NU

{GO𝐏𝟏 GO𝐏𝟐 ⋯ GO𝐏k ⋯ GO𝐏𝒏g }

SC

RI

PT

Step 1. Use BLAST to search all the virus proteins in the Swiss-Prot database for those proteins that have high homology (i.e., more than 60% pairwise sequence identity) with the protein P of Eq.4. The proteins thus obtained are collected into a subset, 𝕊homo , called the homology set of P. Subsequently, 𝐏 retrieve the GO numbers of the protein in 𝕊homo that has the highest homology 𝐏 with P. If it has no GO number at all, do the same for the 2nd highest homologous protein in 𝕊homo ; if it has no GO numbers again, do the same for the 3rd highest 𝐏 homologous one; go on like this until obtaining a set of GO numbers as given below (6)

D

MA

where GO𝐏k (𝑘 = 1, 2, ⋯ , 𝑛g ) is the k-th GO number for the protein in 𝕊homo that 𝐏 has first been found with a set of GO numbers according to the aforementioned order, and 𝑛g is the total GO number it has. Suppose we find from the training dataset that the total number of proteins having exactly the same GO number as GO𝐏k is N(k), of which the number of proteins in the u-th subset is 𝑛(𝑘, u) ( 𝑘 = 1, 2, ⋯ , 𝑛g ;

𝑢 = 1, 2, ⋯ , 𝐿cell )

(7)

PT E

where 𝐿cell = 6 is the total number of subcellular locations investigated (see Eq.2 or Table 1).

𝑛(𝑢, k) Max g [ ] (𝑢 = 1, 2, ⋯ , Ω = 𝐿cell = 6) 1≤k≤𝑛 𝑁(𝑘)

(8)

AC

Ψ𝑢 =

CE

Step 2. Based on Eq.7, the general PseAAC vector in Eq.5 and its dimension can be uniquely defined as

where N(k) is the total number of virus proteins in the training dataset that have the same GO number as GO𝐏k and the operator Max means taking the maximum value among those with respect to different k. It is through such maximization operation to catch the most important GO information for this study and screen out many trivial GO numbers to reduce the vector’s dimension. Listed in Supporting Information S3 are the PseAAC vectors defined by Eq.8 for the 251 sequence-different virus proteins in Supporting Information S1, respectively. As we can see there, the dimension of the current PseAAC vectors has been reduced to 6, much lower than those in the previous approaches [38; 68; 69].

ACCEPTED MANUSCRIPT 7 2.3. Operation Algorithm The 3rd step in the 5-step rule [26] is about the operation algorithm (or engine) to run the prediction. Here, we adopted the ML-GKR (multi-label Gaussian kernel regression) classifier, as described below. According to Eq.8 or Supporting Information S3, the i-th virus protein P𝑖 in the benchmark dataset 𝕊 of Eq.3 can be formulated as 𝑖 PGO = [Ψ1𝑖 Ψ2𝑖 Ψ3𝑖

𝐓

Ψ6𝑖 ] ,



(9)

𝑖 = 1, 2, ⋯ , 𝑁seq )

ℓ𝑖2 ℓ𝑖3

ℓ𝑖6 ]T



where if 𝐏 𝑖 ∈ 𝕊𝑢 (𝑢 = 1, 2, ⋯ , 6) otherwise

SC

ℓ𝑖𝑢 = { +1 −1

(10)

RI

𝐋𝑖 = [ℓ1𝑖

PT

Now let us use the 6-D vector 𝐋𝑖 to describe its subcellular location(s) in the multi-label system; i.e.,

(11)

q

q

q

Pq = [Ψ1 Ψ2 Ψ3

NU

Likewise, for a query virus protein 𝐏 q we have q 𝐓



Ψ6 ]

(12)

q

q

ℓ 2 ℓ3

q T

ℓ6 ]

(13)

D

q

𝐋q = [ℓ1

MA

Its subcellular location label(s) in the multi-label system should be given by

(14)



where +1 −1

if Δ𝑢 ≥ 0 otherwise

(𝑢 = 1, 2, ⋯ , 12)

PT E

ℓ𝑞𝑢 = {

The Δ𝑢 in Eq.13 is given by 𝑁train

CE

Δ𝑢 = [ ∑ ℓ𝑖𝑢 ∙exp (− 𝑖=1

𝑁train

𝒊 2

q

q

𝒊 2

−1

‖𝐏 − 𝐏 ‖ ‖𝐏 − 𝐏 ‖ )] [ ∑ exp (− )] 2 2θ 2θ2

(15)

𝑖=1

where N(train) is the number of proteins used to train the model, θ is a 2

AC

parameter whose optimal value will be determined later, and ‖𝐏 q − 𝐏 𝒊 ‖ is the Euclidean distance [71] between the query protein (Eq.12) and the i-th protein (Eq.9) in the benchmark dataset 𝕊; i.e., q ‖𝐏GO



2 𝒊 𝐏GO ‖

6 q

= ∑(Ψ𝑢 − Ψ𝑢𝑖 )

2

(16)

𝑢=1

Thus, the location label vector 𝐋q of Eq.13 for the query virus protein 𝐏 q is well defined, and hence its subcellular location or locations can be explicitly q

q

q

predicted as well. For example: if ℓ1 = ℓ3 = ℓ6 = +1 while all the other components in Eq.13 are equal to −1, this means that the query virus protein 𝐏 q

ACCEPTED MANUSCRIPT 8 q

is located in the 1st, 3rd, and 6th subcellular locations (cf. Table 1); if ℓ2 = +1 while all the others are equal to −1, meaning that the query virus protein is located in the 2nd subcellular location only; and so forth. The predictor developed via the aforementioned procedures is called pLocmVirus, where “pLoc” stands for “predict subcellular localization”, and “mVirus” for “multi-label virus proteins”. Shown in Fig.1 is a flowchart to illustrate the process of how the pLoc-mVirus is working.

SC

RI

PT

As mentioned in the Chou’s 5-step rule [26], one of the important procedures in developing a new predictor is how to objectively evaluate its anticipated accuracy. To address this, two issues need to be considered. (1) What metrics should be used to quantitatively reflect the predictor’s quality? (2) What test approach should be adopted to score the metrics? 2.4. A Set of Five Metrics for Multi-Label Systems

Nq

MA

NU

Different from the metrics used to measure the prediction quality of singlelabel systems, the metrics for the multi-label systems are much more complicated. To make them more intuitive and easier to understand for most experimental scientists, here we use the following intuitive Chou’s five metrics [17] that have recently been widely used for studying various multi-label systems (see, e.g., [38; 72-75]): ‖𝕃𝑘 ⋂𝕃∗𝑘 ‖ Aiming↑ = q ∑ ( ) , [0, 1] ‖𝕃∗𝑘 ‖ N 𝑘=1 Nq ‖𝕃𝑘 ⋂𝕃∗𝑘 ‖ 1 Coverage↑ = q ∑ ( ) , [0, 1] ‖𝕃𝑘 ‖ N 𝑘=1 Nq ‖𝕃𝑘 ⋂𝕃∗𝑘 ‖ 1 Accuracy↑ = q ∑ ( [0, 1] ∗ ), N 𝑘=1 ‖𝕃𝑘 ⋃𝕃𝑘 ‖ Nq ) 1 Absolute true↑ = q ∑ Δ(𝕃𝑘 , 𝕃∗𝑘 ) , [0, 1]

CE

PT E

D

1

N

1

𝑘=1

Nq

AC

Absolute false↓ = q ∑ ( N { 𝑘=1

(17)

‖𝕃𝑘 ⋃𝕃∗𝑘 ‖ − ‖𝕃𝑘 ⋂𝕃∗𝑘 ‖ ) , [1, 0] 𝑀

where Nq is the total number of query proteins or tested proteins, M is the total number of different labels for the investigated system (for the current study it is 𝐿cell = 14), ‖ ‖ means the operator acting on the set therein to count the number of its elements, ⋃ means the symbol for the “union” in the set theory, ⋂ denotes the symbol for the “intersection”, 𝕃𝑘 denotes the subset that contains all the labels observed by experiments for the k-th tested sample, 𝕃∗𝑘 represents the subset that contains all the labels predicted for the k-th sample, and Δ(𝕃𝑘 , 𝕃∗𝑘 ) 1, if all the labels in 𝕃∗𝑘 are identical to those in 𝕃𝑘 ={ 0, otherwise

(18)

ACCEPTED MANUSCRIPT 9 In Eq.17, the first four metrics with an upper arrow ↑ are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5th metrics with a down arrow ↓ is called negative metrics, implying just the opposite meaning.

NU

SC

RI

PT

From Eq.17 we can see the following: (1) the “Aiming” defined by the 1st subequation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; (2) the “Coverage” defined in the 2nd subequation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; (3) the “Accuracy” in the 3rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; (4) the “Absolute true” in the 4th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; (5) the “Absolute false” in the 5th sub-equation is for checking the ratio of the completely wrong prediction over the total prediction events. 2.5. Jackknife Test

AC

CE

PT E

D

MA

Three cross-validation methods are often used in statistical prediction. They are: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test [71]. Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [26]. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [48; 49; 56; 76-81]). Accordingly, the jackknife test was also used in this study. During the jackknifing process, both the training dataset and testing dataset are actually open, and each protein will be in turn moved between the two. In other word, all the tested proteins must be completely excluded from the corresponding datasets used to train the prediction model. Although the jackknife test will take much longer computational time, it is worthwhile since the result thus obtained will be more objective by excluding any bias and arbitrariness. 3. RESULTS AND DISCUSSION As mentioned in the Chou’s 5-step rule [26], one of the important procedures in developing a new predictor is how to objectively evaluate its anticipated accuracy. To address this, two issues need to be considered. (1) What metrics should be used to quantitatively reflect the predictor’s quality? (2) What test approach should be adopted to score the metrics? 3.1. Parameter Determination Since Eq.15 contains a parameter θ, the predicted results obtained by pLocmVirus will depend on the parameter’s value. In this study, the optimal value for

ACCEPTED MANUSCRIPT 10 θ was determined by maximizing the absolute true rate (see the 4th sub-equation in Eq.17) by the jackknife validation on the benchmark dataset. As shown in Fig.2, when θ = 1/4, the absolute true rate reached its highest score. And such a value would be used for further study. 3.2. Comparison with the State-of-the-Art Predictor

RI

PT

Listed in Table 2 are the rates obtained by the current pLoc-mVirus predictor via the jackknife test on the benchmark dataset (Supporting Information S1). For facilitating comparison, listed in that table are also the corresponding results obtained by the iLoc-Virus [16], the existing most powerful predictor for identifying the subcellular localization of virus proteins with both single and multiple sites.

MA

3.3. Web Server and User Guide

NU

SC

As shown in Table 2, among the five metrics in Eq.17 used to quantitatively measure the quality of a multi-label predictor [17], the rates for “Aiming”, “Accuracy”, and “Absolute false” by iLoc-Virus [16] were missing. It only reported the rates for “Coverage” and “Absolute true”. But both of them are remarkably lower than the corresponding rates achieved by the current predictor pLocmVirus proposed in this paper.

CE

PT E

D

It was pointed out in [82] that publicly accessible web-servers represent the future direction or trend for developing various computational predictors. Actually, user-friendly web-servers as given in a series of recent publications [3335; 43; 73; 74; 83-100] will significantly enhance the impacts of theoretical work because they can be easily used by the broad experimental scientists to get their desired results [44]. Accordingly, the web-server of pLoc-mVirus predictor has been established. Moreover, to maximize their convenience, a step-by-step guide is given in Supporting Information S4. 4. CONCLUSION

AC

Virus protein subcellular location prediction is a challenge problem, particularly when the query virus proteins have multi-label features meaning that they may occur at two or more different location sites. Here, we have developed a new predictor called pLoc-mVirus by incorporating the optimal GO information into Chou’s general PseAAC [26]. Compared with iLoc-Virus [16], the existing most powerful predictor that also has the capacity to deal with the multiple locations of virus proteins, the scores achieved by the new predictor are remarkably better than iLoc-Virus according to the metrics widely used to measure the quality of multi-label predictors. Why could the new predictor be so powerful? The key is that the feature vectors used in the new predictor has been optimized via a special general PsaAAC approach to substantially reduce their dimension but significantly

ACCEPTED MANUSCRIPT 11 optimize their cluster features as realized by Eq.8. Since the publically accessible web-server represents the future direction for developing practically more useful prediction method [82], the web-server for pLoc-mVirus has been established and its user guide is given in Supporting Information S4. It is anticipated that pLoc-mVirus will become a very useful high throughput tool for annotating the subcellular locations of virus proteins.

PT

ACKNOWLEDGMENTS

AC

CE

PT E

D

MA

NU

SC

RI

This work was supported by the grants from the National Natural Science Foundation of China (No. 31560316, 61261027, 61262038, 61202313 and 31260273), the Province National Natural Science Foundation of JiangXi (No. 20132BAB201053), the Jiangxi Provincial Foreign Scientific and Technological Cooperation Project (No.20120BDH80023), the Department of Education of JiangXi Province (GJJ160866)

ACCEPTED MANUSCRIPT 12 Table 1. Breakdown of the virus proteins in the benchmark dataset 𝕊 into the six subsets according to their different subcellular localizations (cf. Supporting Information S1 and Supporting Information S2). Number of proteins

𝕊1

Viral capsid

8

𝕊2

Host cell membrane

33

𝕊3

Host endoplasmic reticulum

20

𝕊4

Host cytoplasm

𝕊5

Host nucleus

𝕊6

Secreted

RI

PT

Subcellular location name

Subset

SC

Total number of virtual proteins Nvira

87 84 20 252 207

The multiplicity degree MDb

1.217

NU

Total number of proteins with different sequences Nseq

the numerator of Eq.1 and the relevant text for the definition of virtual proteins. b See Eq.1 for the definition of multiplicity degree.

AC

CE

PT E

D

MA

a See

ACCEPTED MANUSCRIPT 13

Table 2. Comparison with the state-of-the-art methods in predicting the subcellular localization of virus proteins a Coverage (↑)b

Accuracy (↑)b

Absolute true (↑)b

Absolute false (↓)b

c

88.97%

92.86%

89.77%

82.13%

2.66%

iLoc-Virus d

N/A

78.2%

N/A

74.80%

pLoc-mVirus

PT

Aiming (↑ )b

Predictor

rates listed below were derived by the jackknife test on the benchmark dataset 𝕊 (Supporting Information S1). b See Eq.17 for the definition of the metrics. c The predictor proposed in this paper when θ = 1/4 in Eq.15. d The predictor proposed in [16].

AC

CE

PT E

D

MA

NU

SC

RI

a The

N/A

ACCEPTED MANUSCRIPT 14

FIGURE LEGENDS Figure 1. A flowchart to show the process of how the pLoc-mVirus predictor works.

AC

CE

PT E

D

MA

NU

SC

RI

PT

Figure 2. A plot to show the process of finding the optimal θ value in Eq.15. See the main text for further explanation.

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

15

Figure 1

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

16

AC

CE

PT E

D

MA

Figure 2

ACCEPTED MANUSCRIPT 17 REFERENCES

AC

CE

PT E

D

MA

NU

SC

RI

PT

[1] M.A. Andrade, S.I. O'Donoghue, B. Rost, Adaptation of protein surfaces to subcellular location. J. Mol. Biol. 276 (1998) 517-525. [2] A. Reinhardt, T. Hubbard, Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research. 26 (1998) 2230-2236. [3] K.C. Chou, Using discriminant function for prediction of subcellular location of prokaryotic proteins. Biochem Biophys Res Commun (BBRC). 252 (1998) 63-68. [4] K. Nakai, P. Horton, PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Science. 24 (1999) 34-36. [5] D.W. Elrod, Protein subcellular location prediction. Protein Engineering. 12 (1999) 107-118. [6] O. Emanuelsson, H. Nielsen, S. Brunak, G. von Heijne, Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology. 300 (2000) 1005-1016. [7] G.P. Zhou, K. Doctor, Subcellular location prediction of apoptosis proteins. PROTEINS: Structure, Function, and Genetics. 50 (2003) 44-48. [8] S. Matsuda, J.P. Vert, H. Saigo, N. Ueda, H. Toh, T. Akutsu, A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 14 (2005) 2804-13. [9] J.L. Gardy, M.R. Laird, F. Chen, S. Rey, C.J. Walsh, M. Ester, F.S. Brinkman, PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 21 (2005) 617-23. [10] A. Hoglund, P. Donnes, T. Blum, H.W. Adolph, O. Kohlbacher, MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics. 22 (2006) 1158-65. [11] P. Mundra, M. Kumar, K.K. Kumar, V.K. Jayaraman, B.D. Kulkarni, Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognition Letters. 28 (2007) 1610-1615. [12] P. Horton, K.J. Park, T. Obayashi, N. Fujita, H. Harada, C.J. Adams-Collier, K. Nakai, WoLF PSORT: protein localization predictor. Nucleic Acids Res. 35 (2007) W585-7. [13] K.C. Chou, H.B. Shen, Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry. 370 (2007) 1-16. [14] E. Glory, R.F. Murphy, Automated subcellular location determination and high-throughput microscopy. Dev Cell. 12 (2007) 7-16. [15] H.B. Shen, Virus-mPLoc: A Fusion Classifier for Viral Protein Subcellular Location Prediction by Incorporating Multiple Sites. J Biomol Struct Dyn (JBSD). 28 (2010) 175-86. [16] X. Xiao, Z.C. Wu, iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology. 284 (2011) 42-51.

ACCEPTED MANUSCRIPT 18

AC

CE

PT E

D

MA

NU

SC

RI

PT

[17] K.C. Chou, Some Remarks on Predicting Multi-Label Attributes in Molecular Biosystems. Molecular Biosystems. 9 (2013) 1092-1100. [18] G. Aguero-Chapin, A. Antunes, F.M. Ubeira, H. Gonzalez-Diaz, Comparative Study of Topological Indices of Macro/Supra-molecular RNA Complex Networks. Journal of Chemical Information & Modeling. 48 (2008) 2265-2277. [19] F.J. Prado-Prado, H. Gonzalez-Diaz, O.M. de la Vega, F.M. Ubeira, Unified QSAR approach to antimicrobials. Part 3: First multi-tasking QSAR model for InputCoded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorganic & Medicinal Chemistry. 16 (2008) 58715880. [20] C.R. Munteanu, A.L. Magalhaes, E. Uriarte, H. Gonzalez-Diaz, Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. Journal of Theoretical Biology. 257 (2009) 303-311. [21] S. Vilar, H. Gonzalez-Diaz, L. Santana, E. Uriarte, A network-QSAR model for prediction of genetic-component biomarkers in human colorectal cancer. Journal of Theoretical Biology. 261 (2009) 449-458. [22] H. Gonzalez-Diaz, Network topological indices, drug metabolism, and distribution. Curr Drug Metab. 11 (2010) 283-4. [23] Y. Rodriguez-Soca, C.R. Munteanu, J. Dorado, A. Pazos, F.J. Prado-Prado, H. Gonzalez-Diaz, Trypano-PPI: a web server for prediction of unique targets in trypanosome proteome by using electrostatic parameters of protein-protein interactions. J Proteome Res. 9 (2010) 1182-90. [24] H. Gonzalez-Diaz, F. Prado-Prado, E. Sobarzo-Sanchez, M. Haddad, S. Maurel Chevalley, A. Valentin, J. Quetin-Leclercq, M.A. Dea-Ayuela, M. Teresa GomezMunos, C.R. Munteanu, J. Jose Torres-Labandeira, X. Garcia-Mera, R.A. Tapia, F.M. Ubeira, NL MIND-BEST: A web server for ligands and proteins discoveryTheoretic-experimental study of proteins of Giardia lamblia and new compounds active against Plasmodium falciparum. Journal of Theoretical Biology. 276 (2011) 229-49. [25] C.R. Munteanu, H. Gonzalez-Diaz, F. Borges, A.L. de Magalhaes, Natural/random protein classification models based on star network topological indices. Journal of Theoretical Biology. 254 (2008) 775-783. [26] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology. 273 (2011) 236-247. [27] Y. Xu, J. Ding, L.Y. Wu, iSNO-PseAAC: Predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition PLoS ONE. 8 (2013) e55844. [28] W. Chen, P.M. Feng, H. Lin, iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition Nucleic Acids Research 41 (2013) e68. [29] Q. Su, W. Lu, D. Du, F. Chen, B. Niu, Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression Oncotarget. doi:10.18632/oncotarget.17210 (2017).

ACCEPTED MANUSCRIPT 19

AC

CE

PT E

D

MA

NU

SC

RI

PT

[30] Y. Xu, X. Wen, X.J. Shao, N.Y. Deng, iHyd-PseAAC: Predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. International Journal of Molecular Sciences (IJMS). 15 (2014) 7594-7610. [31] H. Lin, E.Z. Deng, H. Ding, W. Chen, K.C. Chou, iPro54-PseKNC: a sequencebased predictor for identifying sigma-54 promoters in prokaryote with pseudo ktuple nucleotide composition. Nucleic Acids Research. 42 (2014) 12961-12972. [32] W. Chen, P. Feng, H. Ding, H. Lin, Using deformation energy to analyze nucleosome positioning in genomes. Genomics. 107 (2016) 69-75. [33] W. Chen, H. Tang, J. Ye, iRNA-PseU: Identifying RNA pseudouridine sites Molecular Therapy - Nucleic Acids 5 (2016) e332. [34] P. Feng, H. Ding, H. Yang, W. Chen, iRNA-PseColl: Identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC Molecular Therapy - Nucleic Acids 7(2017) 155-163. [35] B. Liu, F. Yang, K.C. Chou, 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Molecular Therapy - Nucleic Acids. 7 (2017) 267-277. [36] H.B. Shen, A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Analytical Biochemistry. 394 (2009) 269-274. [37] Z.C. Wu, X. Xiao, iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems. 8 (2012) 629-641. [38] W.Z. Lin, J.A. Fang, X. Xiao, iLoc-Animal: A multi-label learning classifier for predicting subcellular localization of animal proteins Molecular BioSystems. 9 (2013) 634-644. [39] H.B. Shen, A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0 PLoS ONE. 5 (2010) e9931. [40] Z.C. Wu, X. Xiao, iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One. 6 (2011) e18258. [41] J. Chen, R. Long, X.L. Wang, B. Liu, dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Scientific Reports (2016) 6:32333. [42] X. Xiao, P. Wang, W.Z. Lin, J.H. Jia, iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical Biochemistry. 436 (2013) 168-177. [43] J. Jia, Z. Liu, X. Xiao, B. Liu, pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. Journal of Theoretical Biology. 394 (2016) 223-230. [44] K.C. Chou, Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry. 11 (2015) 218-234. [45] K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid

ACCEPTED MANUSCRIPT 20

AC

CE

PT E

D

MA

NU

SC

RI

PT

composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60). 43 (2001) 246-255. [46] X.B. Zhou, C. Chen, Z.C. Li, X.Y. Zou, Using Chou's amphiphilic pseudo amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology. 248 (2007) 546–551. [47] H. Lin, The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. Journal of Theoretical Biology. 252 (2008) 350-356. [48] M. Esmaeili, H. Mohabatkar, S. Mohsenzadeh, Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology. 263 (2010) 203-209. [49] H. Mohabatkar, M. Mohammad Beigi, A. Esmaeili, Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo amino acid composition and support vector machine. Journal of Theoretical Biology. 281 (2011) 18-23. [50] L. Nanni, A. Lumini, D. Gupta, A. Garg, Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information. IEEE-ACM Transaction on Computational Biolology and Bioinformatics. 9 (2012) 467-475. [51] E. Pacharawongsakda, T. Theeramunkong, Predict Subcellular Locations of Singleplex and Multiplex Proteins by Semi-Supervised Learning and DimensionReducing General Mode of Chou's PseAAC. IEEE Transactions on Nanobioscience. 12 (2013) 311-320. [52] S. Mondal, P.P. Pai, Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J Theor Biol. 356 (2014) 30-5. [53] S. Ahmad, M. Kabir, M. Hayat, Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC. Comput Methods Programs Biomed. 122 (2015) 165-74. [54] L.M. Liu, Y. Xu, iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC Med Chem. 13, doi:10.2174/1573406413666170515120507 (2017). [55] M. Rahimi, M.R. Bakhtiarizadeh, A. Mohammadi-Sangcheshmeh, OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. J Theor Biol. 414 (2017) 128-136. [56] P.K. Meher, T.K. Sahu, V. Saini, A.R. Rao, Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC. Sci Rep. 7 (2017) 42362. [57] Y. Xu, C. Li, iPreny-PseAAC: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC Med Chem. doi:10.2174/1573406413666170419150052 (2017). [58] K.C. Chou, Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics. 6 (2009) 262-274.

ACCEPTED MANUSCRIPT 21

AC

CE

PT E

D

MA

NU

SC

RI

PT

[59] W. Chen, H. Lin, Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst. 11 (2015) 26202634. [60] K.C. Chou, An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry 17, doi: 10.2174/1568026617666170414145508 (2017). [61] B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences . Nucleic Acids Research. 43 (2015) W65-W71. [62] B. Liu, H. Wu, Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein Sequences. Natural Science. 9 (2017) 67-91. [63] Z.C. Wu, X. Xiao, iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular BioSystems. 7 (2011) 3287-3297. [64] X. Xiao, Z.C. Wu, A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS ONE. 6 (2011) e20592. [65] S. Wan, M.W. Mak, S.Y. Kung, GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo amino acid composition. Journal of Theoretical Biology. 323 (2013) 4048. [66] K.C. Chou, H.B. Shen, Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols. 3 (2008) 153-162. [67] H.B. Shen, Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun (BBRC). 355 (2007) 1006-1011. [68] Y.D. Cai, A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochemical and Biophysical Research Communications (BBRC). 311 (2003) 743-747. [69] K.C. Chou, H.B. Shen, Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research. 5 (2006) 1888-1897. [70] T. Wang, J. Yang, H.B. Shen, Predicting membrane protein types by the LLDA algorithm. Protein & Peptide Letters. 15 (2008) 915-921. [71] C.T. Zhang, Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 30 (1995) 275-349. [72] W.R. Qiu, B.Q. Sun, Z.C. Xu, iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics. 32 (2016) 3116-3123. [73] X. Cheng, S.G. Zhao, X. Xiao, iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics. 33 (2017) 341346. [74] X. Cheng, S.G. Zhao, X. Xiao, iATC-mHyb: a hybrid multi-label classifier for

ACCEPTED MANUSCRIPT 22

AC

CE

PT E

D

MA

NU

SC

RI

PT

predicting the classification of anatomical therapeutic chemicals Oncotarget. doi: 10.18632/oncotarget.17028 (2017). [75] X. Cheng, X. Xiao, pLoc-mPlant: predict subcellular localization of multilocation plant proteins via incorporating the optimal GO information into general PseAAC. Molecular BioSystems. doi:10.1039/c7mb00267j (2017). [76] G.P. Zhou, N. Assa-Munt, Some insights into protein structural class prediction. PROTEINS: Structure, Function, and Genetics. 44 (2001) 57-59. [77] D.W. Elrod, Prediction of enzyme family classes. Journal of Proteome Research. 2 (2003) 183-190. [78] K.C. Chou, H.B. Shen, MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through PsePSSM. Biochem Biophys Res Comm (BBRC). 360 (2007) 339-345. [79] F. Ali, M. Hayat, Classification of membrane protein types using Voting Feature Interval in combination with Chou's Pseudo Amino Acid Composition. J Theor Biol. 384 (2015) 78-83. [80] M. Tahir, M. Hayat, iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. Mol Biosyst 12 (2016) 2587-93. [81] M. Khan, M. Hayat, S.A. Khan, N. Iqbal, Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. J Theor Biol. 415 (2017) 13-19. [82] H.B. Shen, Recent advances in developing web-servers for predicting protein attributes. Natural Science. 1 (2009) 63-92 [83] J. Jia, Z. Liu, X. Xiao, B. Liu, iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget. 7 (2016) 34558-34570. [84] Y. Xu, X.J. Shao, L.Y. Wu, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ. 1 (2013) e171. [85] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC. Oncotarget. 7 (2016) 44310-44321. [86] J. Jia, Z. Liu, X. Xiao, iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol. 377 (2015) 47-56. [87] W.R. Qiu, X. Xiao, Z.H. Xu, iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier. Oncotarget. 7 (2016) 51270-51283. [88] J. Wang, B. Yang, J. Revote, A. Leier, T.T. Marquez-Lago, G. Webb, J. Song, T. Lithgow, POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles Bioinformatics. doi:10.1093/bioinformatics/btx302 (2017). [89] W.R. Qiu, B.Q. Sun, D. Xu, iPhos-PseEvo: Identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey

ACCEPTED MANUSCRIPT 23

AC

CE

PT E

D

MA

NU

SC

RI

PT

system theory. Molecular Informatics. 36 (2017). [90] B. Liu, L. Fang, R. Long, X. Lan, iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 32 (2016) 362-369. [91] W. Chen, H. Ding, P. Feng, H. Lin, iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget. 7 (2016) 16895-16909. [92] X. Xiao, H.X. Ye, Z. Liu, J.H. Jia, iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition. Oncotarget. 7 (2016) 34180-34189. [93] C.J. Zhang, H. Tang, W.C. Li, H. Lin, iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget. 7 (2016) 69783-69793. [94] B. Liu, S. Wang, R. Long, iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 33 (2017) 35-41. [95] W. Chen, P. Feng, H. Yang, H. Ding, iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget. 8 (2017) 4208-4217. [96] B. Liu, H. Wu, D. Zhang, X. Wang, Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget. 8 (2017) 13338-13343. [97] W.R. Qiu, S.Y. Jiang, Z.C. Xu, iRNAm5C-PseDNC: identifying RNA 5methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition Oncotarget 8(2017) 41178-41188. [98] Y. Xu, X. Wen, L.S. Wen, L.Y. Wu, iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS ONE. 9 (2014) e105018. [99] Z. Liu, X. Xiao, D.J. Yu, J. Jia, W.R. Qiu, pRNAm-PC: Predicting Nmethyladenosine sites in RNA sequences via physical-chemical properties. Anal Biochem. 497 (2016) 60-67. [100] W.R. Qiu, S.Y. Jiang, B.Q. Sun, X. Xiao, iRNA-2methyl: identify RNA 2′-Omethylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier. Med Chem. doi: 10.2174/1573406413666170623082245 (2017).

ACCEPTED MANUSCRIPT 24 ABREVIATION LIST GO: Gene Ontology PseAAV: Pseudo Amino Acid Composition

AC

CE

PT E

D

MA

NU

SC

RI

PT

ML-GKR: Multi-Label Gaussian Kernel

ACCEPTED MANUSCRIPT 25

pLoc-mVirus: predict subcellular localization of multilocation virus proteins via incorporating the optimal GO information into general PseAAC Xiang Cheng1, Xuan Xiao1,3*, Kuo-Chen Chou2,3*

PT E

D

MA

NU

SC

RI

PT

A flowchart to show the process of how the pLoc-mVirus predictor is working.

AC

CE

Graphical abstract

ACCEPTED MANUSCRIPT 26

HIGHLIGHTS We propose a new predictor for predicting the subcellular localization of virus proteins with both single and multiple location sites.



It was developed via a deep learning from GO (Gene Ontology) database.



Its success rates are significantly higher than the state-of-the-art method in this area.



Its web-server has been established by which users can easily get their desired results.

AC

CE

PT E

D

MA

NU

SC

RI

PT