Prediction of nuclear receptors with optimal pseudo amino acid composition

Prediction of nuclear receptors with optimal pseudo amino acid composition

Analytical Biochemistry 387 (2009) 54–59 Contents lists available at ScienceDirect Analytical Biochemistry journal homepage: www.elsevier.com/locate...

178KB Sizes 0 Downloads 78 Views

Analytical Biochemistry 387 (2009) 54–59

Contents lists available at ScienceDirect

Analytical Biochemistry journal homepage: www.elsevier.com/locate/yabio

Prediction of nuclear receptors with optimal pseudo amino acid composition Qing-Bin Gao, Zhi-Chao Jin, Xiao-Fei Ye, Cheng Wu, Jia He * Department of Health Statistics, Second Military Medical University, Shanghai 200433, China

a r t i c l e

i n f o

Article history: Received 6 September 2008 Available online 19 January 2009 Keywords: Nuclear receptors Subfamily prediction Physicochemical character Pseudo amino acid composition Support vector machines

a b s t r a c t Nuclear receptors are involved in multiple cellular signaling pathways that affect and regulate processes such as organ development and maintenance, ion transport, homeostasis, and apoptosis. In this article, an optimal pseudo amino acid composition based on physicochemical characters of amino acids is suggested to represent proteins for predicting the subfamilies of nuclear receptors. Six physicochemical characters of amino acids were adopted to generate the protein sequence features via web server PseAAC. The optimal values of the rank of correlation factor and the weighting factor about PseAAC were determined to get the appropriate descriptor of proteins that leads to the best performance. A nonredundant dataset of nuclear receptors in four subfamilies is constructed to evaluate the method using support vector machines. An overall accuracy of 99.6% was achieved in the fivefold cross-validation test as well as the jackknife test, and an overall accuracy of 98.4% was reached in a blind dataset test. The performance is very competitive with that of some previous methods. Ó 2009 Elsevier Inc. All rights reserved.

Nuclear receptors form a superfamily of ligand-activated transcription factors that play a crucial role in the regulation of gene expression and, thus, are potential drug targets for the therapy of diseases such as breast cancer, diabetes, inflammatory diseases, and osteoporosis [1]. They act by binding small molecules that can easily be modified by drug design and controlling functions associated with these major diseases [2]. The study of nuclear receptor structure and function is essential for a proper understanding of normal and abnormal cellular mechanisms because of both their physiology and their pathophysiology [3]. As a rising branch, the recognition of subfamilies or types of novel nuclear receptors is crucial for developing therapeutic strategies for the diseases mentioned above because the function of a nuclear receptor is closely correlated with its category. Nuclear receptors share a common structural organization. They consist of multiple domains or regions, including (i) The N-terminal region (A/B domain), which is variable in length from less than 50 to more than 500 amino acids (AAs)1; (ii) the conserved DNA-binding domain (C domain), which notably contains the P-box, a short motif responsible for DNA-binding specificity on sequences typically containing the AGGTCA motif, and is involved in dimerization of nuclear receptors; (iii) the largest and the moderately conserved ligand-binding domain (E domain), whose secondary structure of 12 helixes is

* Corresponding author. Fax: +86 21 81870418. E-mail address: [email protected] (J. He). 1 Abbreviations used: AA, amino acid; NLS, nuclear localization signal; SVM, support vector machine; PseAA, pseudo amino acid; OSH, optimal separating hyperplane; RBF, radial basic function. 0003-2697/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.ab.2009.01.018

better conserved than the primary structure (the E domain is responsible for many functions, mostly ligand induced); (iv) the less conserved region (D domain), which behaves as a flexible hinge between the C and E domains and contains the nuclear localization signal (NLS) that may overlap on the C domain; and (v) the F domain in the C terminus of the E domain, whose sequence is extremely variable and whose structure and function are unknown (nuclear receptors may or may not contain an F domain) [2]. The defining members of the nuclear receptor superfamily are first identified biochemically as receptors for steroid and thyroid hormones [4]. One alternative method for the identification of additional members of the nuclear receptor superfamily is using sequence similarities to known nuclear receptors, based primarily on the conserved DNA-binding domain motif [5], which may be performed by sequence similarity searching tools such as BLAST [6] and FASTA [7]. However, these methods are not always successful when the query protein sequences have no significant sequence similarity to the database sequences. As for nuclear receptors, the major limitation of these searching tools is that they are not able to recognize the subfamilies of nuclear receptors [8]. Therefore, a reliable and fast computational method for classifying the nuclear receptors at subfamily level is strongly desired. Bhasin and Raghava [8] investigated the problem using the predictor of support vector machines (SVMs). In their fivefold cross-validation test, a nonredundant dataset consisting of 282 nuclear receptors was classified to four subfamilies with overall accuracy of 82.6 and 97.5% based on AA composition and dipeptide composition, respectively. The results demonstrate that there is a

55

Prediction of nuclear receptors / Q.-B. Gao et al. / Anal. Biochem. 387 (2009) 54–59

direct correlation between the protein sequences and the subfamilies of nuclear receptors. Although high accuracy has been observed, the problem is worthy of further investigation because AA composition and dipeptide composition might lose some sequence order information. The pseudo amino acid (PseAA) composition is originally introduced to incorporate such sequence order effects of proteins [9]. Currently, various types of PseAA composition have been proposed to improve the prediction quality of protein attributes, including protein subcellular localization [10–19], membrane protein type [20–26], protein structural class [27–32], protein subnuclear localization [33,34], protein submitochondria localization [35], conotoxin superfamily [36,37], protein oligomer type [38,39], enzyme family class [40–44], protease type [45,46], outer membrane protein [47], and cofactors of oxidoreductases [48]. In this article, an alternative method based on two types of PseAA composition and SVMs is introduced to predict the subfamilies of nuclear receptors. SVMs have been proved to be a powerful tool in multiple areas of biological analysis. PseAA composition can be computed easily via the web server PseAAC [49]. More important, we examine the influence of rank of correlation factor and weighting factor (two parameters for computing PseAA composition) on the prediction performance of the current method. Different types of PseAA composition were generated, and the one with the highest prediction accuracy was singled out as the descriptor for protein sequences. The overall accuracy of 99.6% was reached in a fivefold cross-validation test, and the overall accuracy of 98.4% was achieved in a blind dataset test. The results demonstrate the efficiency of the current method. It can be anticipated that this method would play a complementary role for the high-quality prediction of nuclear receptors and other protein attributes. Materials and methods Dataset The dataset of nuclear receptors used in this article was extracted from the nucleaRDB information system (April 2005 release 5.0) available at http://www.receptors.org/NR [1]. This database contains eight known nuclear receptor subfamilies: thyroid hormone-like, HNF4-like, estrogen-like, nerve growth factor IB-like, Fushi tarazu-F1-like, germ cell nuclear factor-like, knirps-like, and DAX-like. All putative orphan receptors and receptor fragments were excluded from the dataset. The remaining 722 proteins constitute the original dataset. To reduce bias, a redundancy reduction procedure was performed on this dataset. Sequences with a high degree of similarity to the other sequences in the dataset were removed by the program CDHIT [50–52], which clusters the protein sequence database at a high sequence identity threshold. This program can remove the high sequence redundancy efficiently. We grouped all protein sequences by CD-HIT with the cluster identity threshold of 0.9 to ensure that no sequence had P 90% sequence similarity to any sequences in the dataset. After such a screening procedure, subfamilies with too few nuclear receptors were also excluded from the dataset for statistical significance. These subfamilies are nerve growth factor IB-like (11 proteins), germ cell nuclear factor-like (3 proteins), knirps-like (7 proteins), and DAX-like (9 proteins). The final dataset contains 345 proteins belonging to four subfamilies and is available at http://stat.smmu.edu.cn/ bioinfo. This dataset was separated into two subsets, a training dataset and a blind dataset, for evaluating the performance of the proposed method. The blind dataset is used as a platform for comparing the performance of the proposed method with a previous method. The number of proteins and their distribution in each subfamily of the datasets are shown in Table 1.

Table 1 Number of protein sequences in each subfamily. Subfamily

Final dataset

Training dataset

Blind dataset

Thyroid hormone HNF4 Estrogen Fushi tarazu-F1 Total

131 87 104 23 345

114 72 75 21 282

17 15 29 2 63

PseAA composition The newly published web server PseAAC [49] supplies us with a flexible tool for generating various types of PseAA composition that can be accessed at http://chou.med.harvard.edu/bioinf/PseAA. Three different types of parameters are used to generate various kinds of PseAA composition. They are quantitative characters of AAs, rank of correlation, and weighting factor. Now, six physicochemical characters of AAs are supported to calculate the correlations between AAs at different positions along the protein sequence. They are hydrophobicity, hydrophilicity, side chain mass, pK of the a-COOH group, pK of the a-NH3+ group, and pI at 25 °C. Thus, 63 different parallel correlation types (type I) of PseAA composition and 63 different series correlation types (type II) of PseAA composition as well as the dipeptide PseAA composition can be generated by PseAAC. The dimension for the output of type I PseAA composition is 20 + k [9], the dimension for the output of type II PseAA composition is 20 + (n  k) [42], and the dimension for the dipeptide PseAA composition is 420. Here k is a nonnegative integer smaller than the length of the input sequence representing the rank of correlation of AAs along a protein sequence, and n is the number of AA characters selected by the user (n = 6 in this study). In particular, when k = 0, the PseAA composition degenerated to the conventional AA composition. The weighting factor is designed for the user to put weight on the additional PseAA components with respect to the conventional AA components. The hydrophobicity and hydrophilicity of amino acids in a protein are closely correlated with its structure and functions, including its folding, its interaction with the environment and other molecules, and its catalytic mechanism. Some studies have shown that these two indexes can be used to effectively and partially reflect the sequence order effects through the amphiphilic PseAA composition [42,44]. In the current study, all of the aforementioned six physicochemical characters were adopted to generate PseAA composition, which is subsequently used as a descriptor to characterize protein sequences. This input vector is expected to be able to encapsulate more sequence order and structural information of proteins, and its prediction quality with different parameters is discussed in this study. The values of each element of PseAA composition were normalized between 0 and 1 using the standard conversion formula before it was inputted into the prediction engine of SVMs. Support vector machines The SVM [53] is a popular machine learning algorithm based on structural risk minimization for pattern classification. Its basic idea is described briefly as follows. For a series of training vectors xi e Rd (i = 1, 2, ..., n) in two classes with corresponding labels yi e {+1, –1} (i = 1, 2, ..., n), here +1 and –1 stand for the two respective classes. The goal is to construct a binary classifier or derive a decision function from the available training samples that has a small probability of misclassifying a future sample. SVM performs a mapping of the training vectors xi (i = 1, 2, ..., n) from the training space Rd into a higher dimensional space H by a kernel function K(xi, xj) and finds an optimal separating hyperplane

56

Prediction of nuclear receptors / Q.-B. Gao et al. / Anal. Biochem. 387 (2009) 54–59

(OSH), which maximizes the margin between the hyperplane and the nearest data points of each class in the space H. Different kernel functions define different SVMs. Typical kernel functions include the following:

Kðxi ; xj Þ ¼ ðxi  xj þ 1Þd

ð1Þ

Kðxi ; xj Þ ¼ expðcjjxi  xj jj2 Þ:

ð2Þ

Eq. (1) is the polynomial kernel function of degree d; when d = 1, it is the linear kernel function. Eq. (2) is the radial basic function (RBF) kernel, where c = 1/r and r is called the width of the kernel. Training an SVM is equivalent to resolving the following convex quadratic optimization problem:

Min :

1 2

n X

n X

i¼1

j¼1

ai aj yi yj Kðxi ; xj Þ 

n X

ai ;

yi ai ¼ 0

sgn

ð5Þ

! yi ai Kðxi ; xÞ þ b :

pðiÞ ; obsðiÞ

ð8Þ

Results and discussion

,where n is the number of training samples, C is the regularization parameter used to control the trade-off between training error and the margin, and a (i = 1, 2, ..., n) are the coefficients. The decision function is n X

accuracy ¼

ð7Þ

ð4Þ

and

i ¼ 1; 2;    ; n

pðiÞ overall accuracy ¼ i¼1 N

where N is the total number of sequences in the dataset, k is the type number, obs(s) is the number of sequences observed in location i, and p(i) is the number of correctly predicted sequences of location i.

i¼1

0 6 ai 6 C;

k P

ð3Þ

i¼1

subject to n X

reference, the prediction performance based on the three methods is illustrated in this article. Furthermore, as a demonstration of practical application and unbiased test, the SVM-based method is further evaluated on an independent or blind dataset. Two measures are used to assess the prediction performance of the current method. The overall accuracy and classification accuracy are defined by Hua and Sun [60]:

ð6Þ

i¼1

The software used to implement SVMs was LIBSVM, developed by Chang and Lin [54], which can be downloaded freely from http:// www.csie.ntu.edu.tw/~cjlin/libsvm for academic purposes. The SVM classifiers were trained with the ‘‘one versus rest” method to handle the multiclass problem in this study. Empirical studies have shown that the RBF kernel outperforms the linear kernel and polynomial kernel. Therefore, in this work the SVMs with RBF kernel were employed to construct the predictor. Performance evaluation In statistical prediction, the following three methods are often used to examine a predictor for its effectiveness in practical applications: resubstitution test, subsampling (5- or 10-fold cross-validation) test, and jackknife test. A resubstitution test is an examination for the self-consistency of a prediction method. When the resubstitution test is performed for the current study, the subfamilies of each protein in the dataset are in turn identified using the rule parameters derived from the same dataset, the so-called training dataset. In a subsampling test, the dataset of all proteins is divided into n (n = 5 or 10) subsets of approximately equal size. This means that the dataset was partitioned into training and testing data in n different ways. After training the SVMs with a collection of n – 1 subsets, the performance of the SVMs was tested against the left subset. This process is repeated n times so that every subset is used once as the test data. In a jackknife test, each protein in the dataset is in turn singled out as a tested protein, and the SVMs are trained based on the remaining proteins. However, as elucidated and demonstrated by Chou and Shen [55,56], among the three methods, the jackknife test is deemed to be the most objective one that can always yield a unique result for a given benchmark dataset [57]; hence, it has been increasingly been used by investigators to examine the accuracy of various predictors [27,28,30,31,34,35,37,58,59]. To provide a broader overview and

To find a representative descriptor of proteins that can classify the subfamilies of nuclear receptors with high accuracy, we investigated the prediction performance of type I and type II PseAA compositions, respectively. The type I and type II PseAA compositions were generated by PseAAC with different parameters. The PseAA composition resulting in the highest prediction accuracy was accepted to describe protein sequences. Various values of parameter c were tested by SVMs with RBF kernel, and our choices were c = 0.5 and 0.05 for the prediction of nuclear receptors with type I and type II PseAA compositions, respectively. The parameter C was set at 500 in this study. Results on type I PseAA composition To consider the influence of the rank of correlation factor k on the prediction quality, at first we set the weighting factor w at 0.05. Different values of k (see Table 2) were then offered to generate the dimensionally different PseAA compositions for presenting proteins. The optimal value of k is selected when it results in the best overall cross-validated accuracy, and it may vary with different training datasets. Table 2 shows the results of the resubstitution test, jackknife test, and fivefold cross-validation test for the RBF kernel SVM classifiers with the parameters c = 0.5 and C = 500 using the type I PseAA composition. From Table 2, we

Table 2 Overall accuracy (%) of type I PseAA composition with different values of k (C = 500, c = 0.5). k

Resubstitution

Jackknife

Fivefold cross-validation

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

92.199 92.908 93.972 93.972 94.681 94.326 95.390 94.681 95.036 96.454 96.454 97.163 96.809 96.809 96.454 95.390

91.135 92.553 91.844 93.617 95.745 94.681 95.390 95.390 95.036 95.390 96.454 96.454 96.454 96.099 95.745 94.681

Note. The highest accuracy is shown in bold.

57

Prediction of nuclear receptors / Q.-B. Gao et al. / Anal. Biochem. 387 (2009) 54–59

can see that in the resubstitution test, the overall accuracy for the 282 proteins in the training dataset is 100%, indicating perfect selfconsistency of the current predictor. As for the other two performance evaluation methods, high prediction accuracy is also observed. Table 2 indicates that the prediction performance has a close correlation with the value of k. For the current study, the optimal value for k is 60; that is, the dimension of the PseAA composition considered here is 80 (20 + 60). The highest overall accuracies of 97.2 and 96.5% were observed in the jackknife test and fivefold cross-validation test, respectively. As a beneficial complement, we also examined the effect of the weighting factor w on the prediction quality, which adjusts the latter k components to be in similar scales as the first 20 AA composition components. During this procedure, the k is fixed at 60 while the w is changed. The prediction results by the three testing methods are shown in Table 3. From this table, we notice that the weighting factor has a slight influence on the overall prediction performance. In the current conditions, the best results are observed at w = 0.35. The overall accuracies are 100, 97.9, and 96.8%, respectively, for the resubstitution test, jackknife test, and fivefold cross-validation test. Results on type II PseAA composition Similarly, we first set the weighting factor w at 0.05 to consider the relationship between the prediction performance and the rank of correlation factor k. Table 4 shows the results of the current method based on type II PseAA composition using RBF kernel SVM classifiers with the parameters c = 0.05 and C = 500. It is clear that extremely high accuracy has been reached by the proposed method with k = 25; that is, the dimension of the PseAA composition considered here is 170 [20+(625)]. The overall accuracies are 100, 99.6, and 99.6% for the resubstitution test, jackknife test, and fivefold cross-validation test, respectively. The results indicate that type II PseAA composition is a better representation of protein

Table 3 Overall accuracy (%) of type I PseAA composition with different values of w. w

Resubstitution

Jackknife

Fivefold cross-validation

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

100 100 100 100 100 100 100 100 100

97.163 97.163 97.518 97.518 97.518 97.518 97.872 97.518 97.518

96.454 96.454 96.454 96.809 96.809 96.809 96.809 96.809 96.809

Note. The highest accuracy is shown in bold.

Table 4 Overall accuracy (%) of type II PseAA composition with different values of k (C = 500, c = 0.05). k

Resubstitution

Jackknife

Fivefold cross-validation

5 10 15 20 25 30 35 40 45 50

100 100 100 100 100 100 100 100 100 100

93.617 96.809 98.936 99.645 99.645 99.645 99.645 99.291 99.291 98.582

93.262 96.099 97.163 99.291 99.645 98.936 98.936 98.936 98.936 98.582

Note. The highest accuracy is shown in bold.

sequences than is type I PseAA composition for distinguishing the subfamilies of nuclear receptors. As for the effect of weighting factor on the prediction quality of type II PseAA composition, the conclusion is equivalent to that of type I PseAA composition. Table 5 shows the results of this concern. In this study, the best results are observed with w = 0.05. The overall accuracies are 100, 99.6, and 99.6% for the resubstitution test, jackknife test, and fivefold cross-validation test, respectively. Therefore, in this work type II PseAA composition generated with k = 25 and w = 0.05 is recommended to predict nuclear receptors. Comparison with previous methods To check the performance of our method, we made comparisons with a previous method, NRpred, developed by Bhasin and Raghava [8]. They had also used SVMs and fivefold cross-validation to test the performance of their method on a dataset of 282 nuclear receptors in four subfamilies. Although the dataset is different, the number of proteins and sequence similarity (6 90%) in each subfamily are equal. The prediction performance of the two methods based on the fivefold cross-validation test is shown in Table 6. From this table, we can see that our method achieved a superior performance. The overall accuracy is improved from 97.5 to 99.6%; in particular, the accuracy of Fushi tarazu-F1 is enhanced from 85.3 to 100%. This indicates that the current method is also very suitable for small groups. As for the subfamily of thyroid hormone, the difference between the two methods is hard to determine because both are 100%. But the proposed method achieved the high accuracy with fewer features (170 vs. 400). Therefore, from Table 6, we can conclude that the current method is appropriate for both small-size and large-size subfamilies. At the same time, we also employed the blind dataset built into this work to compare the performance of the two methods. It is ideal to evaluate methods on a blind or independent dataset to demonstrate their true or unbiased performance as elucidated by Bhasin and Raghava [61]. The sequences of the blind dataset are not used for either training or testing during the development of the method. The prediction results on the blind dataset are shown in Table 7. From this table, we can see that the numbers of correctly predicted proteins of the two methods are equivalent on the blind dataset. But we believe that with more testing samples, the prediction accuracy of the

Table 5 Overall accuracy (%) of type II PseAA composition with different values of w. w

Resubstitution

Jackknife

Fivefold cross-validation

0.05 0.1 0.15 0.2 0.25 0.3

100 100 100 100 100 100

99.645 99.645 99.291 99.291 98.936 98.936

99.645 99.645 98.936 98.936 98.936 98.936

Note. The highest accuracy is shown in bold.

Table 6 Comparison with NRpred by fivefold cross-validation. Subfamily

Total

Thyroid hormone HNF4 Estrogen Fushi tarazu-F1 Overall

114 72 75 21 282

Note. The highest accuracy is shown in bold.

Accuracy (%) NRpred

Current method

100 95.8 98.7 85.3 97.5

100 98.6 100 100 99.6

58

Prediction of nuclear receptors / Q.-B. Gao et al. / Anal. Biochem. 387 (2009) 54–59

Table 7 Comparison with NRpred on the blind dataset. Nuclear receptor class

Total

Thyroid hormone HNF4 Estrogen Fushi tarazu-F1 Overall

17 15 29 2 63

Correctly predicted NRpred

Current method

17 15 29 1 32/33 (98.4%)

17 15 29 1 32/33 (98.4%)

current method will be superior to that of NRpred; this was demonstrated in the rigorous cross-validation test. Importantly, the number of protein features used in our method is smaller than that used in NRpred; this can reduce the computational time of the prediction system, especially during the large-scale genome prediction process. In short, the comparisons demonstrate the applicability of the current method and possible improvement of accuracy for the prediction of nuclear receptors. Conclusion In this article, we have introduced a promising method based on PseAA composition for predicting the subfamilies of nuclear receptors with SVMs. Six physicochemical characters of AAs are adopted to generate two types of sequence features via the web sever PseAAC [49]. To obtain the best performance of the proposed method, different values of rank of correlation factor and weighting factor for PseAAC were tested. Finally, the optimal values of k = 25 for the rank of correlation factor and w = 0.05 for the weighting factor were selected to generate the two types of PseAA composition. The overall accuracies of 95.7 and 99.6% were observed by fivefold cross-validation based on the produced PseAA compositions of type I and type II, respectively. According to the results, we can conclude that PseAA composition of type II is a better representation of proteins for the prediction of nuclear receptors. Moreover, in the comparison with a previous method, the proposed method also exhibited a very competitive performance. In conclusion, a novel method for the prediction of nuclear receptors has been developed. It is anticipated that this method would play a complementary role in the prediction of nuclear receptors and other protein attributes such as subcellular localization, membrane types, and enzyme family and subfamily classes. Acknowledgments The authors thank the anonymous reviewers for their helpful comments that improved this article greatly. This work was supported by the National Natural Science Foundation of China (grant 30671821) and the Natural Science Foundation of Shanghai (grant 08JC1405100). References [1] F. Horn, G. Vriend, F.E. Cohen, Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems, Nucleic Acids Res. 29 (2001) 346–349. [2] M. Robinson-Rechavi, H. Escriva Garcia, V. Laudet, The nuclear receptor superfamily, J. Cell Sci. 116 (2003) 585–586. [3] E. Martinez, D.D. Moore, E. Keller, D. Pearce, The nuclear receptor resource. A growing family, Nucleic Acids Res. 26 (1998) 239–241. [4] A.E. Sluder, A.E. Mathews, D. Hough, V.P. Yin, C.V. Maina, The nuclear receptor superfamily has undergone extensive proliferation and diversification in nematodes, Genome Res. 9 (1999) 103–120. [5] D.J. Manglesdorf, C. Thummel, M. Beato, P. Herrlich, G. Schutz, K. Umesono, B. Blumberg, P. Kastner, M. Mark, P. Chambon, R.M. Evans, The nuclear receptor superfamily: the second decade, Cell 83 (1995) 835–839. [6] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403–410.

[7] W.R. Pearson, D.J. Lipman, Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. USA 85 (1988) 2444–2448. [8] M. Bhasin, G.P.S. Raghava, Classification of nuclear receptors based on amino acid composition and dipeptide composition, J. Biol. Chem. 279 (2004) 23262– 23266. [9] K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, Proteins Struct. Funct. Genet. 43 (2001) 246–255 (Erratum, vol. 44, 2001, p. 60). [10] Y.D. Cai, K.C. Chou, Nearest neighbor algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition, Biochem. Biophys. Res. Commun. 305 (2003) 407–411. [11] Y.D. Cai, K.C. Chou, Predicting subcellular localization of proteins in a hybridization space, Bioinformatics 20 (2004) 1151–1156. [12] Y. Gao, S.H. Shao, X. Xiao, Y.S. Ding, Y.S. Huang, Z.D. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein subcellular location: approached with Lyapunov index, Bessel function, and Chebyshev filter, Amino Acids 28 (2005) 373–376. [13] X. Xiao, S. Shao, Y. Ding, Z. Huang, Y. Huang, K.C. Chou, Using complexity measure factor to predict protein subcellular location, Amino Acids 28 (2005) 57–61. [14] X. Xiao, S. Shao, Y. Ding, Z. Huang, K.C. Chou, Using cellular automata images and pseudo amino acid composition to predict protein subcellular location, Amino Acids 30 (2006) 49–54. [15] K.C. Chou, H.B. Shen, Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites, J. Proteome Res. 6 (2007) 1728–1734. [16] K.C. Chou, H.B. Shen, Large-scale plant protein subcellular location prediction, J. Cell. Biochem. 100 (2007) 665–678. [17] H.B. Shen, K.C. Chou, Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites, Biochem. Biophys. Res. Commun. 355 (2007) 1006–1011. [18] J.Y. Shi, S.W. Zhang, Q. Pan, Y.M. Cheng, J. Xie, Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition, Amino Acids 33 (2007) 69–74. [19] Y.L. Chen, Q.Z. Li, Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition, J. Theor. Biol. 248 (2007) 377–381. [20] M. Wang, J. Yang, G.P. Liu, Z.J. Xu, K.C. Chou, Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition, Protein Eng. Des. Sel. 17 (2004) 509–516. [21] H. Liu, M. Wang, K.C. Chou, Low-frequency Fourier spectrum for predicting membrane protein types, Biochem. Biophys. Res. Commun. 336 (2005) 737–739. [22] H.B. Shen, K.C. Chou, Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo amino acid composition to predict membrane protein types, Biochem. Biophys. Res. Commun. 334 (2005) 288–292. [23] S.Q. Wang, J. Yang, K.C. Chou, Using stacked generalization to predict membrane protein types based on pseudo amino acid composition, J. Theor. Biol. 242 (2006) 941–946. [24] Y.D. Cai, K.C. Chou, Predicting membrane protein type by functional domain composition and pseudo-amino acid composition, J. Theor. Biol. 238 (2006) 395–400. [25] H.B. Shen, J. Yang, K.C. Chou, Fuzzy KNN for predicting membrane protein types from pseudo amino acid composition, J. Theor. Biol. 240 (2006) 9–13. [26] K.C. Chou, H.B. Shen, MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through PsePSSM, Biochem. Biophys. Res. Commun. 360 (2007) 339–345. [27] C. Chen, X.B. Zhou, Y.X. Tian, X.Y. Zou, P.X. Cai, Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network, Anal. Biochem. 357 (2006) 116–121. [28] C. Chen, Y.X. Tian, X.Y. Zou, P.X. Cai, J.Y. Mo, Using pseudo amino acid composition and support vector machine to predict protein structural class, J. Theor. Biol. 243 (2006) 444–448. [29] X. Xiao, S. Shao, Z. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor, J. Comput. Chem. 27 (2006) 478–482. [30] Y.S. Ding, T.L. Zhang, K.C. Chou, Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network, Protein Pept. Lett. 14 (2007) 811–815. [31] H. Lin, Q.Z. Li, Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components, J. Comput. Chem. 28 (2007) 1463–1466. [32] T.L. Zhang, Y.S. Ding, K.C. Chou, Predicting protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern, J. Theor. Biol. 250 (2008) 186–193. [33] H.B. Shen, K.C. Chou, Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition, Biochem. Biophys. Res. Commun. 337 (2005) 752–756. [34] P. Mundra, M. Kumar, K.K. Kumar, V.K. Jayaraman, B.D. Kulkarni, Using pseudo amino acid composition to predict protein subnuclear localization: approached with PSSM, Pattern Recognit. Lett. 28 (2007) 1610–1615. [35] P. Du, Y. Li, Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence, BMC Bioinform. 7 (2006) 518. [36] S. Mondal, R. Bhavna, R. Mohan Babu, S. Ramakumar, Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification, J. Theor. Biol. 243 (2006) 252–260.

Prediction of nuclear receptors / Q.-B. Gao et al. / Anal. Biochem. 387 (2009) 54–59 [37] H. Lin, Q.Z. Li, Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant, Biochem. Biophys. Res. Commun. 354 (2007) 548–551. [38] K.C. Chou, Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition, Proteins Struct. Funct. Genet. 53 (2003) 282–289. [39] S.W. Zhang, Q. Pan, H.C. Zhang, Z.C. Shao, J.Y. Shi, Prediction of protein homooligomer types by pseudo amino acid composition: approached with an improved feature extraction and naive Bayes feature fusion, Amino Acids 30 (2006) 461–468. [40] K.C. Chou, Y.D. Cai, Predicting enzyme family class in a hybridization space, Protein Sci. 13 (2004) 2857–2863. [41] Y.D. Cai, K.C. Chou, Predicting enzyme subclass by functional domain composition and pseudo amino acid composition, J. Proteome Res. 4 (2005) 967–971. [42] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10–19. [43] H.B. Shen, K.C. Chou, EzyPred: a top-down approach for predicting enzyme functional classes and subclasses, Biochem. Biophys. Res. Commun. 364 (2007) 53–59. [44] X.B. Zhou, C. Chen, Z.C. Li, X.Y. Zou, Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. Theor. Biol. 248 (2007) 546–551. [45] K.C. Chou, Y.D. Cai, Prediction of protease types in a hybridization space, Biochem. Biophys. Res. Commun. 339 (2006) 1015–1020. [46] G.P. Zhou, Y.D. Cai, Predicting protease types by hybridizing gene ontology and pseudo amino acid composition, Proteins: Struct. Funct. Bioinform. 63 (2006) 681–684. [47] H. Lin, The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition, J. Theor. Biol. 252 (2008) 350–356. [48] G.Y. Zhang, B.S. Fang, Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo amino acid composition, J. Theor. Biol. 253 (2008) 310–315.

59

[49] H.B. Shen, K.C. Chou, PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2008) 386–388. [50] W. Li, L. Jaroszewski, A. Godzik, Clustering of highly homologous sequences to reduce the size of large protein databases, Bioinformatics 17 (2001) 282– 283. [51] W. Li, L. Jaroszewski, A. Godzik, Tolerating some redundancy significantly speeds up clustering of large protein databases, Bioinformatics 18 (2002) 77– 82. [52] W. Li, A. Godzil, CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics 22 (2006) 1658– 1659. [53] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998. [54] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/cjlin/libsvm. [55] K.C. Chou, H.B. Shen, Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms, Nat. Protoc. 3 (2008) 153–162. [56] K.C. Chou, H.B. Shen, Review: recent progresses in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1–16. [57] K.C. Chou, C.T. Zhang, Review: prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. [58] Q.B. Gao, Z.Z. Wang, Classification of G-protein coupled receptors at four levels, Protein Eng. Des. Sel. 19 (2006) 511–516. [59] Y. Fang, Y. Guo, Y. Feng, M. Li, Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features, Amino Acids 34 (2008) 103–109. [60] S. Hua, Z. Sun, Support vector machine approach for protein subcellular localization prediction, Bioinformatics 17 (2001) 721–728. [61] M. Bhasin, G.P.S. Raghava, GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors, Nucleic Acids Res. 32 (2004) W383–W389.