Automatic transcription factor classifier based on functional domain composition

BBRC Biochemical and Biophysical Research Communications 347 (2006) 141–144 www.elsevier.com/locate/ybbrc Automatic transcription factor classiﬁer ba...

Download PDF

513KB Sizes 0 Downloads 32 Views

Report

PDF Reader
Full Text

BBRC Biochemical and Biophysical Research Communications 347 (2006) 141–144 www.elsevier.com/locate/ybbrc

Automatic transcription factor classiﬁer based on functional domain composition Ziliang Qian

a,b

, Yu-Dong Cai

c,d,*

, Yixue Li

a,e,*

a

Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China b Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, China c CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, China d Biomolecular Sciences Department, University of Manchester, Institute of Science and Technology, P.O. Box 88, Manchester M60 1QD, UK e Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, 200235 Shanghai, China Received 7 June 2006 Available online 21 June 2006

Abstract To understand the transcriptional regulatory mechanism, it is indispensable to identify transcription factors (TF) from the whole genome and to classify transcription factors into diﬀerent classes. New computational approaches have been developed to identify TFs/nonTFs, and furthermore to classify TFs into four diﬀerent classes, based on the protein functional domain composition [K.C. Chou, Y.D. Cai, Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem. 277 (2002) 45765–45769]. We trained and tested our method on a non-redundancy dataset consisting of 74 transcription factors collected from TRANSFAC v7.0 [V. Matys, O.V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A.E. Kel, E. Wingender, TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes, Nucleic Acids Res. 34 (2006) D108–D110] and 1558 non-transcription factors from UniProtKB/Swiss-Prot Release 49.3 of 21-Mar-2006. The overall success rates of jackknife cross-validation tests reached 98.4% for TF/non-TF identiﬁcation and 97.2% for classiﬁcations of TF classes: basic domains, zinc-coordinating DNA-binding domains, helix-turn-helix, and b-scaﬀold factors. 2006 Elsevier Inc. All rights reserved. Keywords: Transcription factors; Functional domain composition; Intimate sorting classiﬁer; Jackknife cross-validation test

Transcription factor (TF) is often termed as the major regulator of transcription. Most of them (or dimers) bind to speciﬁc DNA fragments using their DNA-binding domain and modulate nearby genes’ transcription through their trans-activating/repressing domains. Generally, transcription factors can be classiﬁed into four major classes [2–4] (cf. Fig. 1): (1) Basic domains. (2) Zinc-coordinating DNA-binding domains. (3) Helix-turn-helix. (4) b-Scaﬀold factors with Minor Groove Contacts. Given a newly identiﬁed protein with poor prior knowledge, the following two questions are often raised: (1) Is it a transcription factor?

*

Corresponding authors. E-mail addresses: [email protected] (YD Cai), [email protected] (YX Li).

0006-291X/$ - see front matter 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2006.06.060

(2) Which class it belongs to? Both are very important to understand its transcriptional regulatory function. Previous works on TF identiﬁcation/classiﬁcation were remarkably based on manual annotations [4], e.g., experimental data on transcription regulatory activity, protein structure and whole sequence homology, etc. However, collecting protein annotations manually is a time-consuming task. To overcome this problem, in this research we developed an automatic method to discriminate TFs from nonTFs and furthermore to classify TFs into four categories mentioned above based on protein functional domain composition, which has already been successfully used to predict protein–protein interaction [5], protein structure [6], protein function [7–9], protein subcellular location [1], etc. The classiﬁer we built got a fairly good performance

142

ZL Qian et al. / Biochemical and Biophysical Research Communications 347 (2006) 141–144

Fig. 1. Transcription factor classiﬁcation. Transcription factors are generally classiﬁed into four distinct classes. Left-up, zinc-coordinating DNA-binding domains. Right-up, basic domains. Left-bottom, helix-turn-helix. Right-bottom, b-scaﬀold factors. 3D structures of protein–DNA complexes were adapted from [18]. Proportion was calculated based on TRANSFAC(v7.0) [2].

with overall success rates of 98.4%, 97.2% for TF/non-TF identiﬁcation and TF classiﬁcation, respectively. Materials and methods TF/non-TF datasets. First, for transcription factors (TFs) with classiﬁcation information, the dataset came from TRANSFAC v7.0 [2], and for non-TF, the dataset was randomly selected from UniProtKB/Swiss-Prot Release 49.3 of 21-Mar-2006 by using keyword ‘‘membrane,’’ ‘‘secretory,’’ ‘‘antigen,’’ ‘‘transferase,’’ ‘‘kinase.’’ All together, a dataset with 1176 transcription factors and 29,295 non-transcription factors was built. And then we reﬁned this dataset as follows: (1) Filter out proteins with a length over 5000 aa or less than 50 aa and those without SwissProt accession number. (2) Remove the redundancy against homology bias using the programs cd-hit [10,11] and PISCES [12]. As a consequence, none of the sequences investigated have more than 25% sequence identity. Finally, a positive dataset with 84 TFs with known classiﬁcation information and a negative dataset with 2167 non-transcription factor proteins were obtained (Table 1). Functional domain composition feature vector. To facilitate a feasible statistical classiﬁer, each transcription factor (TF) must be expressed in Table 1 The original dataset Dataset Transcription factors (TF)

Size Basic domain Zinc-coordinate Helix-turn-helix b-Scaﬀold Overall

21 15 36 12 84

Non-TF

2167

Total

2251

terms of a set of discrete numbers instead of whole amino-acid sequence to catch the core features intimately related to biological functions. Because TFs are classiﬁed according to their structures and functions, it is anticipated that the prediction quality will be enhanced if we can ﬁnd a feasible approach to use the knowledge of structural and functional domains to deﬁne a transcription factor sample, such as DNA-binding domain(s), oligomerization domain(s), and trans-activating domain. This can be realized through the integrated domain and motif database, or the interPro databases at [http://www.ebi.ac.uk/interpro] through the following steps. Step 1. Extract domains information of a protein from InterPro by using the Protein2ipr mapping provided. For our TF/nonTF dataset, Protein2ipr release 12.0 on Friday November 18th 2005 [ftp://ftp.ebi.ac.uk/pub/databases/interpro/] was used. The result totally covered 8151 InterPro entries with well-known structural and functional domain types. Step 2. With each of the 8151 functional domain patterns as a vectorbase, the sample of a TF can be represented in a 8151D (dimensional) vector as: If there is a hit, e.g., transcription factor P49716 contains IPR004827 which is the 1970th record of the 8151 domains, then the 1970th component of the transcription factor P49716 in the 8151D feature space is set to 1; otherwise 0. Step 3. Then feature vector T for a given TF can thus be explicitly formulated as 0 t 1 1

B B B B B T ¼B B B B B @

t2 .. . ti .. . t8151

where,

C C C C C C; C C C C A

ð1Þ

ZL Qian et al. / Biochemical and Biophysical Research Communications 347 (2006) 141–144 ti ¼

1;

hit found;

0;

otherwise:

ð2Þ

Deﬁned in this way, each transcription factor will correspond to a 8151D vector T with each of the 8151 functional domains as the base for the vector space. In other words, rather than the amino acid composition approach or pseudo-amino acid composition as often used by previous investigators [7,13,14], a TF is now represented in terms of the functional domain composition. By doing so, not only some sequence-related features but also some function-related features are naturally incorporated in the representation. Intimate sorting (ISort) classiﬁer. The prediction was performed with the ISort classiﬁer, which can be brieﬂy described as follows. Suppose there are N transcription factors (TFs) T1, T2, . . . ,Tn, . . . ,TN which have already been classiﬁed into categories 1, 2, . . . ,c(n), . . . ,l, c(n) is the category of Tn. Now, for a query TF T, how can we predict which class it belong to? To deal with this problem, let us deﬁne the following scale to measure the similarity between T and Ti (i = 1, 2, . . . ,N) KðT ; T i Þ ¼

T Ti ; kT k kT i k

ð3Þ

where TÆTi is the dot product of T and Ti, and iTi and iTii are their moduli, respectively. Obviously, when T ” Ti, we have K(T, Ti) = 1, meaning they have perfect 100% similarity. Generally speaking, the similarity is within the range of 0 and 1, 0 6 K(T, Ti) 6 1. Accordingly, the ISort predictor can be formulated as follows. If the similarity between T and Tk (k = 1, 2, . . . ,N) is the highest, i.e., KðT ; T k Þ ¼ maxfKðT ; T 1 Þ; KðT ; T 2 Þ; . . . ; KðT ; T N Þg;

ð4Þ

where the operator max means taking the maximum one among those in the brackets, then the transcription factor T is predicted belonging to the same category as of Tk. If there is a tie, i.e., KðT ; T k1 Þ ¼ KðT ; T k2 Þ ¼ maxfKðT ; T 1 Þ; KðT ; T 2 Þ; . . . ; KðT ; T N Þg;

143

implementations, Jackknife cross-validation tests were operated as follows: For identifying TF/non-TF, for each protein T in the dataset consisting of 74 TFs and 1558 non-TFs, we applied the ﬁrst ISort classiﬁer to predict T ’s property (TF/non-TF) using the rest proteins excluding T. Classiﬁer succeeded if it correctly predicted the property of T. Then the success rate for TF/non-TF identiﬁcation was given according to the following formulas:

8 Correctly predicted TF > : > True TF < Success rate for TF ¼ Correctly predicted non-TF Success rate for non-TF ¼ : True non-TF > > : Success rate for overall ¼ Correctly predicted : Total

For classifying TFs into four diﬀerent classes, for each protein T in the dataset consisting of 74 TFs, we applied the second ISort classiﬁer to predict T ’s classiﬁcation using the rest proteins excluding T. Classiﬁer succeeded if it correctly predicted the classiﬁcation of T. Then the success rate for TF classiﬁcation was given according to the following formulas: 8 Success > > > > > > > < Success Success > > > > Success > > > : Success

predicted “basic domain” rate for “basic domain” ¼ Correctly : True “basic domain” predicted “zinc-coordinating” rate for “zinc-coordinating” ¼ CorrectlyTrue : “zinc-coordinate” predicted “helix-turn-helix” rate for “helix-turn-helix” ¼ Correctly : True “helix-turn-helix” predicted “beta-scaffold” rate for “beta-scaffold” ¼ Correctly : True “beta-scaffold” predicted rate for overall ¼ Correctly : Total

ð5Þ

the query transcription factor T cannot be uniquely determined. In these rare cases, T will be randomly classiﬁed into category as of either T k1 or T k2 .

Results and discussion Two 8151D ISort classiﬁers were built, one for identifying TF/non-TFs and another for further classifying TFs into four diﬀerent categories: basic domains, zinccoordinating DNA-binding domains, helix-turn-helix, and b-scaﬀold factors. According to step 1, step 2, and step 3 mentioned above, we obtain the following results. (1) For TF/non-TF identiﬁcation, with the exclusion of proteins that have no functional domain annotation and except orphans that have domains occurring only once in our original dataset, 8151D feature vectors were built for 74 TFs and 1558 non-TFs (cf. Table 2, Supplement ﬁle). (2) For TF classiﬁcation, three more TFs were ﬁltered because of orphans, thus 8151D feature vectors were built for 71 TFs (cf. Table 3, Supplement ﬁle). Jackknife cross-validation test was adopted to examine the performance of our predictor. In statistical prediction, the single independent dataset test, sub-sampling test, and the Jackknife test are the three cross validation approaches often used to examine the power of a predictor. Of these three, the Jackknife test is deemed as the most objective and rigorous one and hence adopted by more and more investigators [15–17]. In our

ð6Þ

ð7Þ

Tables 2 and 3 give the success rates of Jackknife crossvalidation test for TF/non-TF identiﬁcation and TF Table 2 The performances of TF/non-TF identiﬁcation Category

Jackknife test success rate

TF Non-TF

66/74 = 89.2% 1540/1558 = 98.8%

Overall

1606/1632 = 98.4%

Jackknife test successful rates in identifying TF/non-TFs. 74 out of 84 transcription factors (TF) and 1558 out of 2167 non-TF left after removing proteins which have no functional domain and removing orphans which have domains that occurred only once in our dataset.

Table 3 The performances of TF classiﬁcation Classiﬁcation

Jackknife test success rate

Basic domain Zinc-coordinate Helix-turn-helix b-Scaﬀold

20/20 = 100% 10/11 = 90.9% 33/33 = 100% 6/7 = 85.7%

Overall

69/71 = 97.2%

Jackknife test successful rates in classifying TFs into four diﬀerent classes. 71 out of 84 transcription factors left after removing proteins which have no functional domain and removing orphans which have domains that occurred only once in our dataset.

144

ZL Qian et al. / Biochemical and Biophysical Research Communications 347 (2006) 141–144

classiﬁcation, respectively. Our predictors got a very good performance. As shown in Table 2, the success rates were 89.2%, 98.8% for TF and non-TF identiﬁcation, respectively, and 98.4% overall. As shown in Table 3, the success rates reached 100%, 90.9%, 100% and 85.7% for basic domain TFs, zinc-coordinating TFs, helix-turn-helix TFs, and b-scaﬀold TFs, respectively, meanwhile 97.2% overall. These cheerful results demonstrate that domain composition is a very eﬀective means to characterize the features of TF for classiﬁcation. The computation was performed in a Silicon Graphics IRIS Indigo workstation (Elan 4000). Conclusion To understand transcription regulation mechanism, there is a great demand to identify TFs and further classify TFs into functional categories. Therefore, we built two automatic classiﬁers, respectively, in this contribution. We got a fairly good result, 98.4% for TF/non-TF identiﬁcation and 97.2% for four TF classiﬁcations, which means that our classiﬁers can be used as a good support in TF annotations as the amount of protein sequences increases rapidly in the post-genomic era. Acknowledgments This work was funded by National ‘‘973’’ Basic Research Program (2004CB518606, 2003CB715900, and 2001CB510209) and National ‘‘863’’ High-Tech R&D Program (2004BA711A21). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.bbrc. 2006.06.060. References [1] K.C. Chou, Y.D. Cai, Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem. 277 (2002) 45765–45769.

[2] V. Matys, O.V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A.E. Kel, E. Wingender, TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes, Nucleic Acids Res. 34 (2006) D108–D110. [3] V. Matys, E. Fricke, R. Geﬀers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A.E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, E. Wingender, TRANSFAC(R): transcriptional regulation, from patterns to proﬁles, Nucleic Acids Res. 31 (2003) 374–378. [4] E. Wingender, Classiﬁcation scheme of eukaryotic transcription factors Molekularnaya, Biologiya 31 (1997) 483–497. [5] J. Wojcik, V. Schachter, Protein–protein interaction map inference using interacting domain proﬁle pairs, Bioinformatics 17 (2001) S296– S305. [6] K.C. Chou, Y.D. Cai, Predicting protein structural class by functional domain composition, Biochem. Biophys. Res. Commun. 321 (2004) 1007–1009. [7] Y.D. Cai, K.C. Chou, Predicting enzyme subclass by functional domain composition and pseudo amino acid composition, J. Proteome Res. 4 (2005) 967–971. [8] Y.D. Cai, A.J. Doig, Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition, Bioinformatics 20 (2004) 1292–1300. [9] X.J. Yu, J.C. Lin, T.L. Shi, Y.X. Li, A novel domain-based method for predicting the functional classes of proteins, Chinese Sci. Bull. 49 (2004) 2379–2384. [10] W. Li, L. Jaroszewski, A. Godzik, Tolerating some redundancy signiﬁcantly speeds up clustering of large protein databases, Bioinformatics 18 (2002) 77–82. [11] W. Li, L. Jaroszewski, A. Godzik, Clustering of highly homologous sequences to reduce the size of large protein databases, Bioinformatics 17 (2001) 282–283. [12] G. Wang, R.L. Dunbrack Jr., PISCES: a protein sequence culling server, Bioinformatics 19 (2003) 1589–1591. [13] G.P. Zhou, An intriguing controversy over protein structural class prediction, J. Protein Chem. 17 (1998) 729–738. [14] K.C. Chou, A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space, Proteins: Struct. Funct. Genet. 21 (1995) 319–344. [15] Y.D. Cai, K.C. Chou, Predicting membrane protein type by functional domain composition and pseudo-amino acid composition, J. Theor. Biol. 238 (2006) 395–400. [16] Y.D. Cai, K.C. Chou, Predicting membrane protein type by functional domain composition and pseudo-amino acid composition, J. Theor. Biol. 238 (2006) 395–400. [17] X. Yu, C. Wang, Y. Li, Classiﬁcation of protein quaternary structure by functional domain composition BMC, Bioinformatics 7 (2006) 187. [18] N.M. Luscombe, S.E. Austin, M. Helen, J.M. Thornton, An overview of the structures of protein-DNA complexes, Genome Biol. 1 (2000), reviews 001.

Automatic transcription factor classifier based on functional domain composition

Automatic transcription factor classifier based on functional domain composition

Recommend Documents