Tuberculosis 91 (2011) 556e562
Contents lists available at ScienceDirect
Tuberculosis journal homepage: http://intl.elsevierhealth.com/journals/tube
MOLECULAR ASPECTS
MtbSDeA comprehensive structural database for Mycobacterium tuberculosis Sameer Hassan a, P. Logambiga a, A. Mohan Raman a, T.K. Subazini b, V. Kumaraswami a, Luke Elizabeth Hanna a, * a b
National Institute for Research in Tuberculosis, Chetpet, Chennai 600 031, India AU-KBC Research Centre, Madras Institute of Technology, Chennai 600 044, India
a r t i c l e i n f o
s u m m a r y
Article history: Received 26 April 2011 Received in revised form 6 August 2011 Accepted 8 August 2011
The Mycobacterium tuberculosis Structural Database (MtbSD) (http://bmi.icmr.org.in/mtbsd/MtbSD.php) is a relational database for the study of protein structures of M. tuberculosis. It currently holds information on description, reaction catalyzed and domains involved, active sites, structural homologues and similarities between bound and cognate ligands, for all the 857 protein structures that are available for M. tb proteins. The database will be a valuable resource for TB researchers to select the appropriate proteineligand complex of a given protein for molecular modelling, docking, virtual screening and structure-based drug designing. Ó 2011 Elsevier Ltd. All rights reserved.
Keywords: Mycobacterium tuberculosis Proteins Ligands Domains Database
1. Introduction Mycobacterium tuberculosis, the etiological agent for tuberculosis, causes approximately 8e10 million new infections and 3 million deaths worldwide every year.1 In 1993, World Health Organisation (WHO) declared tuberculosis a global emergency. TB is one of the leading causes of mortality in India, killing 2 persons every 3 min, nearly 1000 every day.2 The complete genome of M. tuberculosis comprising of 4,411,529 bp and around 4000 genes was sequenced in 1998.3 Availability of the mycobacterial genome sequence and advancement in structural genomics have set up a platform to answer questions such as the functioning of the organism as an integrated system and its activity in conjunction with the host. Threedimensional structures of proteins are important for understanding their biological function as well as their interaction with ligands. Structure determination for hypothetical proteins could help in the identification of the biological function of a particular protein based on clues obtained from proteins with even distant structural homology and no apparent sequence identity.4,5 Protein Data Bank (PDB) is a valuable resource of structural data for proteins of all eukaryotes and prokaryotes. TBSGC, also hosts
* Corresponding author. Scientist C, Dept. of Clinical Research, National Institute for Research in Tuberculosis, Chetpet, Chennai 600031, India. Tel.: þ91 44 28369575; fax: þ91 44 28362528. E-mail address:
[email protected] (L.E. Hanna). 1472-9792/$ e see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.tube.2011.08.003
similar information but is specific for M. tuberculosis. Currently, both databases contain 853 protein structures for 328 gene products of M. tuberculosis, resulting from multiple solved structures for many of these proteins, viz. mutant forms, structures for individual or multiple domains, and complexes with different ligands. Since proteins are flexible molecules capable of changing their structure based on external stimuli, binding of ligands may induce certain changes in the structure.6 These changes may be important for protein function. Exploring the changes that have occurred in the protein structure due to ligand binding will further help in understanding the functions of the proteins, since some of these structural changes may be important for protein function. When ligand-bound protein conformations are not available, structure-based drug design becomes highly challenging. Several studies have shown that virtual screening with an apo structure usually results in a poor enrichment factor compared to screening with holo structure even when the structural difference between the two is small.7e9 X-ray or NMR structures of a protein/enzyme are not always in complex with its natural substrate or product. Binding of ligands other than cognate ligands could also bring about a range of structural deviations in the proteins. These changes can result in poor enrichment factor. Identification of the appropriate proteineligand complex based on structural similarity between the bound and cognate ligands is very essential for proper modelling, docking, as well for drug designing studies. Further, in order to understand the catalytic activity of a target protein, availability of its crystal structure in combination with its
S. Hassan et al. / Tuberculosis 91 (2011) 556e562
appropriate ligand is very essential. At present, information on catalytic activity and similarity between cognate and bound ligands are available in SwissProt10 and PROCOGNATE11 databases respectively. Though Swissprot and other databases such as TBSGC,12 Tuberculist13 and TBDB14 contain information on M. tuberculosis protein structures, none of these databases provide systematic grouping of structures for each protein, or highlight the differences between structures in the context of domain coverage, bound ligands etc. In this scenario, users have to visit multiple databases and employ many tools to acquire complete information on a given protein for protein modelling, docking and structure-based drugdesigning studies. Keeping in mind the need for a complete structural database for M. tuberculosis proteins, that would incorporate all the above mentioned features, we developed a database called MtbSD (http://bmi.icmr.org.in/mtbsd/MtbSD.php). 2. Materials and methods 2.1. MtbSD platform MySQL server version 5.1.41 was used in MtbSD to store, retrieve and manage data. All the scripts for data querying and retrieving were written in PHP, a highly efficient and widely used scripting language. The web interfaces were designed using HTML language and Cascading Style Sheet (CSS) for consistent styling. 2.2. Database generation Sequence of all M. tuberculosis proteins was retrieved from NCBI database and searched against PDB15 using the BLAST16 tool. Information about different protein structures, bound ligands, functional and catalytic details for each of the proteins was retrieved and carefully tabulated. Individual protein structures were analyzed and categorized as: a) Holo and apo structures based on the presence or absence of bound ligand b) Structures with single and multiple domains c) Structures of protein complexes further categorized as proteineprotein, proteineDNA, proteinedrug and proteinepeptide complexes d) Proteins with structural motifs such as Zn finger, helix turn helix, P-loop, Greek key and Walker A and B motifs e) Proteins expressed during dormancy17 f) Proteins grouped under COG18 functional category and essential proteins required for the survival and growth of M. tuberculosis. g) Structures with unknown function, as that of hypothetical proteins h) Structures for known drug targets
557
2.4. Domain mapping The amino acid sequence of all M. tuberculosis proteins was retrieved from SwissProt and aligned with the sequence of the different available structures for each of the proteins listed in PDB. The alignment was generated using Accelrys Discovery Studio v2.0. Mapping of the PFAM21 domain was performed on the aligned sequences. Residues that are identical, conserved and semi conserved are colour coded for easier understanding. 2.5. Finding of structural homologues For each protein structure having SCOP22 (Structural Classification Of Proteins) classification, the structural homologues for each SCOP domain in the M. tuberculosis proteome was identified by structural superimposition using the SSM server. For protein structures that do not have SCOP classification, the entire protein structure was searched against all the available M. tuberculosis protein structures. 3. Results and discussion 3.1. MtbSD home page The MtbSD home page provides a friendly interface for users. It provides important highlights of M. tuberculosis proteome. In the menu on the left side of the home page, users can find several links that provide information about apoprotein, holoprotein, protein complexes, proteins with structural motifs etc (Figure 1). Users can also query the database using a simple search option provided in the home page. In addition, information about the total number of genes having structures (gene index), submission form for submitting new solved structures by user and contact details (for mailing suggestions, criticisms and possible errors) are provided. 3.2. Search page The MtbSD home page provides a simple “search option” in the menu on the left side and an “advance search” option on top of the home page. In the simple search option, the Rv number, PDB id or gene name brings the user directly to the protein information page containing all information on the queried protein. This is a simple yet faster way to search the database. With the help of the advance search option, users can filter their search by selecting different fields thereby allowing them to access specific data fulfilling the selection criteria. The complete database can be searched using keywords, protein or gene name, accession number, functional categories, etc (Figure 2). The HTML form parses the criteria in MtbSD database and returns the results in a summary table. By clicking the Rv number or MtbSD id in the summary table, detailed information on the protein will be displayed.
2.3. Finding structural similarity of bound and cognate ligand
3.3. MtbSD statistics
663 of the available 857 protein structures of M. tb belong to 251 proteins in complex with ligand. All ligands bound to each protein were compared using SMSD19 software with all compounds known to be either substrates or products for the respective proteins, based on the catalytic information retrieved from SwissProt and KEGG20 databases (MOL files for substrates and products were retrieved from KEGG database for structural comparison). The similarity between the bound ligand and the natural substrates and products of the protein is tabulated. The structural similarity is reported in terms of Tanimoto score.
MtbSD hosts information on 857 structures for 328 proteins encoded by the M. tuberculosis genome. Of the 857 available structures, 824 structures have been determined by X-ray crystallography, 30 structures by NMR method and 3 based on theoretical models. Statistical details for the functional categories of M. tuberculosis proteins are shown in Table 1. SCOP classification is available for structures of 149 gene products. Examining the distribution of M. tuberculosis protein structures in SCOP database revealed that the Alpha/Beta and Alpha þ Beta class of proteins were the most widely represented.
558
S. Hassan et al. / Tuberculosis 91 (2011) 556e562
Figure 1. Mycobacterium tuberculosis Structural Database (a) Home page of MtbSD. (b & c) Tables of apo and holoprotein details that can be accessed from the menu on the left side of the page.
3.4. Gene index
3.5. Structural coverage of functional group based on COG
All the 328 gene products of M. tuberculosis with solved structures are listed in the Gene index page (Figure 3). By clicking the gene name or Rv number, a drop down menu will list the MtbSD accession number and the list of PDB ids for the corresponding gene. Both the MtbSD accession number and the PDB id are hyperlinked to the protein and structure information pages in the database respectively.
Based on the functional classification tree (COG), the 328 gene products having structural information were mapped into 22 functional groups (Figure 4). Most of the solved structures belonged to proteins involved in amino acid transport and metabolism, coenzyme transport and metabolism, general function prediction only, lipid transport and metabolism, signal transduction mechanism and not in COG category. With this kind of information in
Figure 2. (A) The advance search interface of MtbSD. (B) This page is shown when the user searches by PDB id, Rv number, MtbSD etc.
S. Hassan et al. / Tuberculosis 91 (2011) 556e562 Table 1 Functional categorization of M. tuberculosis proteins. Category
Proteins with solved structure
Apoprotein Holoprotein Dormancy Essential proteins Virulence proteins Hypothetical proteins ProteineDNA Proteinedrug Proteineprotein Proteinepeptide
125 251 70 74 22 31 9 55 20 2
MtbSD, researchers are provided with an opportunity to explore the structural details of unsolved proteins that are involved in vital biological processes and thus throw more light on the physiology of M. tuberculosis. 3.6. Protein information page The protein information page provides exclusive information on each protein (Figure 5A) selected by the user. The page has four tables (Summary table, Funtional details, PDB ID and bound ligand details and other resources). The summary table provides information on gene name, protein name, protein length and enzyme classification (EC) number. The functional detail table provides information on the domains in the selected protein. The PFAM id and domain name are hyperlinked to the PFAM database. The table also provides information on protein function, reactions catalyzed, cofactors used and metabolic pathway involved in.
559
The third table provides information on the different solved structures available for each of the selected proteins and the ligands that are bound to each of the structures. Users can also compare the bound ligands with the natural substrates and products (in the catalytic reaction), and select the appropriate protein complex for modelling, docking or drug designing studies. This page also provides alignment of protein sequences for the structures available with the full length sequence of the protein taken from SwissProt (Figure 5B). The alignment is differently colour coded for easy detection of the differences or mutations present in the sequences. The position of each domain is highlighted and labelled. This information tells us, Whether the structure covers the whole protein or part of it Which domains are covered by each individual structure What are the different mutations in each structure For successful drug designing, docking or modelling, selecting the appropriate experimental structure from the available structures is a vital step. For example, inhA gene is an important drug target for isoniazid. InhA is an important protein that actively takes part in lipid metabolism and fatty acid biosynthesis, and uses NAD molecule as its natural substrate. Currently, there are 33 experimentally solved structures for inhA protein. Many of these structures are in complex with NAD and other molecules, whereas a few structures such as 2AQH, 2AQK, 2AQI, 1ENY and 1ENZ are bound to the NAD molecule alone. Among the 33 structures, 2IED is the only apo structure available for inhA protein. Understanding the conformational changes between the apo structure and the holo structure that occurs upon binding to NAD will be an important
Figure 3. Gene index page provides the list of all the proteins of M. tuberculosis having PDB ids either solved by X-ray/NMR method. The user can hover over the gene/Rv number name. A drop down menu will be shown having MtbSD id and list PDB id’s for the respective protein which is hyperlinked.
560
S. Hassan et al. / Tuberculosis 91 (2011) 556e562
Figure 4. Functional classification and structural coverage of M. tuberculosis proteins. For each functional group, number of proteins having structural information is listed.
step in selecting the appropriate template. Apart from the bound ligand, it is also important to know about the mutations present in these structures and the structural coverage of the entire protein for selecting the appropriate template. The selection of protein structure based on the above mentioned factors for drug designing, docking or modelling, will result in high enrichment factor. Both the PDB id and ligand data are hyperlinked to structural information and ligand information page. Since all this information is available under one roof, the user can easily compare the various protein structures and the catalytic activity of the proteins when bound with various ligands. 3.7. Structural information page This page provides information on the experimental method employed for structure determination, resolution, polymer information, molecular classification, publication, authors, etc. (Figure 6A). For structures of proteineprotein complexes, the gene coding for each polymer is also mentioned, so the user need not do
a sequence search against the M. tb genome to indentify the gene. The menu on the left side of the page provides links to secondary structure details, ligand information (Figure 6B), active site (Figure 6C) and structural homologues. The secondary structure information page provides details on the residues involved in forming helices and strands. The arrangement of the strands either in parallel or anti parallel is displayed in the strand order column. The active site information page provides the list of residues that are involved in binding to the ligands, type of interaction, ligand metal interaction and their position in the secondary structure such as helix, strand or loop. Users can download the atom file of the active site residues which can be directly used for docking studies. 3.8. Ligand information page The ligand information page provides ligand id, ligand name, chemical formula, molecular weight and smiles information of the bound and the cognate ligand. Structural similarity of the bound
S. Hassan et al. / Tuberculosis 91 (2011) 556e562
561
Figure 5. (A) Protein information page provides detailed information of the protein. (B) Alignment between structure sequence from PDB and sequence from SwissProt. This page reveals the structure coverage, position of the domain and mutational information.
Figure 6. (A) Structural information page provides information of the protein structure. (B) This page provided information regarding bound ligand details and their similarity to cognate ligand. (C) Active site information is provided in the page and is available for download.
562
S. Hassan et al. / Tuberculosis 91 (2011) 556e562
ligand to the cognate ligand with Tanimoto score is also provided (Figure 6B). This information will facilitate the selection of the appropriate protein complex for further studies.
Ethical approval: Funding:
None.
Competing interests: 3.9. Structural similar fold within the M. tuberculosis genome For each of the given protein structures, other proteins with similar fold within the genome of M. tuberculosis are listed in the Structural homologue page. Each of these hits was manually analyzed to pick up proteins with similar fold even at very low sequence identity. For proteins with SCOP classification, each domain of the protein was taken for superimposition. The class and fold details for the query as well for the hit along with RMSD (Root Means Square Deviation) and sequence identity is given in this page. For proteins which do not have SCOP classification, the entire structure was taken for superimposition. Domain related information of the query and the hit along with RMSD and sequence identity are provided. For the set of structurally related proteins, structures were further grouped into paralogues and analogues. Proteins with similar domains were grouped as paralogues and proteins having similar structure but different functions based on EC classification were grouped as analogues. Structure-based sequence alignment with secondary structure representation for M. tuberculosis proteins is also provided in this page. Users can download the superimposed structural file and structure-based sequence alignment. This information can be effectively used to identify structurally similar proteins to that of known drug targets and to explore whether existing or newly identified inhibitors can bind to other proteins of M. tuberculosis besides the expected drug target. This page can be accessed from the structural information page for each PDB id and also from the advanced search page either using PDB id or Rv number.
4. Conclusions MtbSD has been developed to catalogue and categorize the proteins of M. tuberculosis having structural information; this will help the TB research community to understand the differences between the different structures for each gene product and to select the ideal template for docking, virtual screening and structurebased drug designing. The novelty and usefulness of MtbSD is that it not only is an exhaustive compilation of all data available under the topic, but also the presentation of the data in a comparative form under one roof in a very user-friendly manner. The database will be updated as and when new structures are solved for M. tuberculosis proteins. Tools such as BLAST and ClustalW will be incorporated in the future.
Acknowledgements The authors wish to acknowledge ICMR- Biomedical Informatics and National Institute for Research in Tuberculosis (formerly Tuberculosis Research Centre) for the funding provided. We thank Ms. Reema Singh, Biomedical Informatics Centre, Indian Council of Medical Research, New Delhi for the help rendered in hosting the database. We also thank Mr. Senthilnathan, NIRT and Mr. Deepak. P for their help provided in editing figures and providing technical support.
Not required.
None declared.
References 1. World Health Organisation. Global tuberculosis control: a short update to the 2009 report; 2009. 2. Directorate of Health Services, Ministry of Health and family welfare, www. tbcindia.org.2009. 3. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry 3rd CE, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Krogh A, McLean J, Moule S, Murphy L, Oliver K, Osborne J, Quail MA, Rajandream MA, Rogers J, Rutter S, Seeger K, Skelton J, Squares R, Squares S, Sulston JE, Taylor K, Whitehead S, Barrell BG. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 1998;393:537e44. 4. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H, Kim R, Kim SH. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc Natl Acad Sci U S A 1998;95:15189e93. 5. Kim SH. Shining a light on structural genomics. Nat Struct Biol 1998;5(Suppl.):643e5. 6. Koike R, Amemiya T, Ota M, Kidera A. Protein structural change upon ligand binding correlates with enzymatic reaction mechanism. J Mol Biol 2008;379:397e401. 7. McGovern SL, Shoichet BK. Information decay in molecular docking screens against holo, apo, and modeled conformations of enzymes. J Med Chem 2003;46:2895e907. 8. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS. A critical assessment of docking programs and scoring functions. J Med Chem 2006;49:5912e31. 9. Murray CW, Baxter CA, Frenkel AD. The sensitivity of the results of molecular docking to induced fit effects: application to thrombin, thermolysin and neuraminidase. J Comput Aided Mol Des 1999;13:547e62. 10. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31:365e70. 11. Bashton M, Nobeli I, Thornton JM. PROCOGNATE: a cognate ligand domain mapping for enzymes. Nucleic Acids Res 2008;36:D618e22. 12. Musa TL, Ioerger TR, Sacchettini JC. The tuberculosis structural genomics consortium: a structural genomics approach to drug discovery. Adv Protein Chem Struct Biol 2009;77:41e76. 13. Kapopoulou A, Lew JM, Cole ST. The MycoBrowser portal: a comprehensive and manually annotated resource for mycobacterial genomes. Tuberculosis (Edinb) 2011;91:8e13. 14. Galagan JE, Sisk P, Stolte C, Weiner B, Koehrsen M, Wymore F, Reddy TB, Zucker JD, Engels R, Gellesch M, Hubble J, Jin H, Larson L, Mao M, Nitzberg M, White J, Zachariah ZK, Sherlock G, Ball CA, Schoolnik GK. TB database 2010: overview and update. Tuberculosis (Edinb) 2010;90:225e35. 15. Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr 1998;54: 1078e84. 16. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL. NCBI BLAST: a better web interface. Nucleic Acids Res 2008;36:W5e9. 17. Aguero F, Al-Lazikani B, Aslett M, Berriman M, Buckner FS, Campbell RK, Carmona S, Carruthers IM, Chan AW, Chen F, Crowther GJ, Doyle MA, Hertz-Fowler C, Hopkins AL, McAllister G, Nwaka S, Overington JP, Pain A, Paolini GV, Pieper U, Ralph SA, Riechers A, Roos DS, Sali A, Shanmugam D, Suzuki T, Van Voorhis WC, Verlinde CL. Genomic-scale prioritization of drug targets: the TDR targets database. Nat Rev Drug Discov 2008;7:900e7. 18. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinform 2003;4:41. 19. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. J Cheminform 2009;1:12. 20. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27e30. 21. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. The Pfam protein families database. Nucleic Acids Res 2010;38: D211e22. 22. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002;30:264e7.