doi:10.1016/j.jmb.2006.09.041
J. Mol. Biol. (2006) 364, 836–852
Cognate Ligand Domain Mapping for Enzymes Matthew Bashton 1 ⁎, Irene Nobeli 1,2 and Janet M. Thornton 1 1
EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK 2 Randall Division of Cell and Molecular Biophysics, New Hunt's House, King's College London, Guy's Campus, London, SE1 1UL, UK
Here, we present an automatic assignment of potential cognate ligands to domains of enzymes in the CATH and SCOP protein domain classifications on the basis of structural data available in the wwPDB. This procedure involves two steps; firstly, we assign the binding of particular ligands to particular domains; secondly, we compare the chemical similarity of the PDB ligands to ligands in KEGG in order to assign cognate ligands. We find that use of the Enzyme Commission (EC) numbers is necessary to enable efficient and accurate cognate ligand assignment. The PROCOGNATE database currently has cognate ligand mapping for 3277 (4118) protein structures and 351 (302) superfamilies, as described by the CATH and (SCOP) databases, respectively. We find that just under half of all ligands are only and always bound by a single domain, with 16% bound by more than one domain and the remainder of the ligands showing a variety of binding modes. This finding has implications for domain recombination and the evolution of new protein functions. Domain architecture or context is also found to affect substrate specificity of particular domains, and we discuss example cases. The most popular PDB ligands are all found to be generic components of crystallisation buffers, highlighting the non-cognate ligand problem inherent in the PDB. In contrast, the most popular cognate ligands are all found to be universal cellular currencies of reducing power and energy such as NADH, FADH2 and ATP, respectively, reflecting the fact that the vast majority of enzymatic reactions utilise one of these popular cofactors. These ligands all share a common adenine ribonucleotide moiety, suggesting that many different domain superfamilies have converged to bind this chemical framework. © 2006 Elsevier Ltd. All rights reserved.
*Corresponding author
Keywords: cognate–ligand; domains; CATH; SCOP; KEGG
Introduction One of the key factors underlying the majority of processes in the cell is recognition of small molecules by proteins. The vast majority of enzymes and many components of cellular signalling pathways require the binding of a small molecule or ligand to a protein, often in a specific or selective manner. Such ligands, which are thought to be present in vivo and are usually ”required” for biological function, are denoted here as cognate ligands, whereas ligands that bind (for example in a test-tube) but are either not normally present where the protein functions in vivo or are not required for function are denoted non-cognate. Non-biological enzyme inhibitors or substrate analogues will often be present in crystal
E-mail address of the corresponding author:
[email protected]
structures, and it is these molecules that we term non-cognate. For over 30 years it has been known that particular folds can be associated with the binding of specific types of molecules, such as the Rossmann domain, which binds nicotinamide adenine dinucleotide (NAD).1,2 Yet, for the majority of domains, the extent to which identification of a specific type of domain can indicate what ligand(s) may be bound remains to be determined. Towards this end, we present a systematic approach to study the observed interactions between domains in enzymes, as described by the SCOP3 and CATH4 databases, and the cognate small molecules they bind. We achieve this using data from the worldwide-Protein Data Bank (wwPDB)5 as provided in the Macromolecular Structure Database (MSD),6 the ENZYME7 enzyme nomenclature database, and the KEGG pathway database.8 This study allows us to investigate to what extent particular superfamilies bind particular ligands or particular classes of ligands. We may find that key
0022-2836/$ - see front matter © 2006 Elsevier Ltd. All rights reserved.
Cognate Ligand Domain Mapping for Enzymes
features of the ligand-binding sites in evolutionarily related domains have been conserved and recognize similar chemical moieties in a range of compounds, and/or we may find that similar or identical compounds can be bound by a range of domains, which are not related evolutionarily. The former represents evolutionary divergence of an ancestral domain to give rise to descendents that potentially bind a wider range of ligands, and the latter represents convergent evolution of domains to bind the finite set of small molecules utilized in biology. In this analysis, it is important to distinguish between binding and function, two related proteins may well bind the same molecule yet perform different biological functions, as illustrated by the Rossmann domains, which mostly bind nucleotides but are involved in many different classes of catalysis. Currently, many of the ligands present in structures in the wwPDB5 are not the cognate ligands. For most enzymes, PDB ligands are usually either an inhibitor of activity, or some other substrate or transition-state analogue. This may be due to the difficulty of crystallizing a protein using its cognate ligand(s) or, frequently a small molecule from the crystallization buffer may have bound in place of the cognate ligand. Alternatively, experiments are often designed to explore the binding mode/interactions of an inhibitor for biomedical or pharmaceutical purposes, or to elucidate the catalytic mechanism of the enzyme by choosing a ligand that is close to a proposed transition state. Thus, the goal here is to provide a validated mapping or pairing between protein domains and their cognate ligands, out of all the combinations of small molecules present in the metabolome and proteins in the proteome. In order to circumvent the non-cognate ligand problem, we have studied only enzymes in the PDB for which an enzyme commission (EC) number has been assigned so that, in principle, the cognate substrates, products and cofactors for the enzyme are known. There have been many previous studies associating ligand binding with particular domains in protein structures, with initial studies focusing on a few ligands or a few domains. Rossmann et al.,1 and Ohlsson et al.,2 both observed similarity in the NAD-binding domain in dehydrogenases; Schulz and Schirmer also observed similarity between the nucleotide-binding domains of adenyl kinase, subtilisin, flavodoxin and dehydrogenases.9 The similarity of the NAD and flavin adenine dinucleotide (FAD)-binding domains was observed by Schulz in glutathione reductase,10 and by McKie et al. in various FAD and NAD-dependent reductases and dehydrogenases.11 Several studies have observed commonality in the features of ATP-binding proteins: Walker et al. discussed ATP-binding in ATP synthase, myosin, and kinases, amongst other proteins;12 Kobayashi et al.,13 and later Denessiouk et al.,14 discussed ATP binding in a wider selection of proteins. Other studies have examined nucleotide cofactor binding in a range of domains and
837 ligands.15–18 A more recent and comprehensive survey covering the whole PDB is the PDBLIG database,19 where the binding of all ligands within the PDB is associated with domains of the CATH structural classification. Although we also associate the binding of PDB ligands to domains, our work differs from the PDBLIG database, in that we have mapped the cognate (i.e. natural) ligands used in various enzymatic reactions to particular domains rather than the various inhibitors and analogues that are present in most PDB structures. Some of the features incorporated in our work are available as a subset of other databases, all of which allow the user to search for a particular ligand bound by a protein structure. The MSDsite contains information on the interactions between ligands and protein structures, although this database does not assign the binding of ligands to particular domains and, as the ligands are those found in crystal structures, they are frequently non-cognate. 20 PDBsum from our own research group also incorporates cognate ligand matching, although it does not assign this information to particular domains.21 Relibase22 and Ligand Depot23 are both databases of protein–ligand interactions, although neither offers cognate ligands or associations of ligands with particular protein domains. The BIND database includes information on a wide range of protein–ligand interactions, including those observed in protein structures.24 However the binding of particular ligands in these structures is often to a non-cognate ligand, as with MSDsite, since BIND reports the ligands found in the PDB entry; additionally the binding of ligands is not assigned to particular domains. The recent SiteBase database looks into protein–ligand interaction from an angle different from that of our work, allowing the comparison of binding site similarity in structures, independent of the global sequence and structural similarity.25 This service also show the results in relation to the SCOP domain classification, although it works only with PDB ligands. Another important distinction is that we have explicitly included the protein quaternary structure in our assignments, using the (PQS) assemblies present in the MSD.26 Many ligands interact with multiple chains and therefore it is necessary to include these complexes for correct assignments. Our data are freely available (see Methods). In order to produce the cognate ligand mapping, we firstly assigned the binding of the PDB ligands to specific domains in protein structures. Binding sites may be located on different chains or even discontinuous segments of sequence; additionally, some ligands may be bound by more than one domain, either proportionally in a shared manner, or disproportionately with the vast majority of contacts coming from one domain only. Secondly, all ligands in a PDB entry for a structure are compared using two-dimensional graph-matching to all compounds known to be substrates, products, or cofactors for that enzyme, using data from the ENZYME and KEGG databases. The ligands from the various
838 known reactions for an enzyme (as found in KEGG) are then matched up with the PDB ligands present in the PDB structure. It is the ligands from the list of known reactions for an enzyme in KEGG that we term cognate. PDB ligands that are identical with the cognate KEGG ligands are described as cognate; those that are similar (above a specified threshold) are termed cognate-like. It should be noted that our approach to cognate ligand assignment differs considerably from the practise of trying to computationally infer protein ligand interactions through docking. Here, the ligand is either a known inhibitor, substrate analogue, or a member of a ligand library that is screened against the clefts or a potential binding site in a structure. The aim is to find the best match between the binding site and a small molecule by virtual docking of the molecule into the atomic coordinates of the protein structure. This is a computationally intensive approach that relies on conformational space searching and energy minimisation of the docked ligand. For a comprehensive review of docking, see Halperin et al.27 Our goal in this work in not to predict the ligand for a given structure, or to work out where a particular ligand should dock, but to check that ligands within a structure are cognate and, if not, identify a cognate ligand that could occupy the same site as that of the PDB ligand. The cognate ligands are assigned on the basis of chemical similarity between the known compounds involved in an enzyme's reaction and the ligands found in the PDB structure. Our work is, however, complementary to docking, in that the likely cognate ligands suggested by our work could be docked into existing protein structures using the original PDB ligand location as a starting point. We hope that our database will be useful for exploring the evolution of protein domains and ligand specificity, and for identifying with which ligand(s) a functionally uncharacterized protein with a given domain composition may be likely to bind. Our database may be used also for creating datasets to be used to benchmark programs that attempt to predict the cognate ligand of a particular protein.
Results and Discussion Enzyme Commission numbers are necessary for efficient and accurate cognate ligand mapping During development of the database we tried two different approaches to identify if ligands were cognate. The first used rapid hashed fingerprint matching to compare all compounds present as ligands in the PDB to all compounds in KEGG.28 We hoped that the various inhibitors and substrate analogues, present as ligands in PDB structures, would match the correct cognate ligands in the
Cognate Ligand Domain Mapping for Enzymes
KEGG list of compounds that are found in natural enzyme reactions. The comparison of fingerprints for all ligands and all PDB enzymes took only 50 minutes. However, simply finding the closest match for a PDB ligand to a compound present in KEGG does not always identify the correct cognate ligand. Taking lactate dehydrogenase as an example29 (1ldm), we find that the normal substrate lactate has been replaced by the substrate mimic oxamate, which is an inhibitor of the enzyme's activity. However, since oxamate exists also within KEGG as a cognate ligand for the reaction of the carbamoyl phosphate:oxamate carbamoyltransferase enzyme, oxamate will match itself in the list of KEGG compounds and be marked incorrectly as a cognate ligand. Moreover, many enzymes in the PDB possess modified versions of natural ligands that either have additional chemical groups present, or have natural groups replaced with alternative groups. For example, searching with a modified version of the common cofactor coenzyme A (CoA) against KEGG will often retrieve other modified versions of CoA rather than the cognate CoA. Thus, it is necessary to use the enzyme commission (EC) number to reduce the list of likely cognate ligands for any enzyme and allow correct assignment. Adding the EC number as a filter drastically reduces the number of comparisons to be made, as only the ligands listed for that EC number(s) need be compared to those bound in the structure. Consequently, we can use the computationally more expensive and potentially more accurate graph-matching algorithm. The disadvantage of this approach is that not all enzymes have EC numbers assigned to them, and not all proteins have enzymatic activity; therefore, only partial coverage of the PDB is possible. Protein–ligand mapping Coverage Our database has domain-PDB ligand contacts for 12,171 protein structures described by the CATH database (v. 2.5.1) and 15,480 structures described by the SCOP (v. 1.67) database, covering 40% and 50% of all PDB entries, respectively (out of a total of 30,695 in release 12 of the MSD Search Database (MSDSD) (see Figure 1). From this list, we can provide domain-cognate ligand mappings for 3277 structures in CATH, including 351 superfamilies and 4118 in SCOP, including 302 superfamilies. This considerably extends previous work from our group in assigning cognate ligands to proteins of Escherichia coli,30 where a total of 87 single-domain CATH superfamilies had cognate ligands assigned. An important advance in this study is that we have assigned the binding of particular ligands to particular domains in multi-domain proteins by considering per-residue contact data to ligands in conjunction with the SCOP and CATH domain boundaries for each structure. Figure 1 shows an overview of these statistics.
Cognate Ligand Domain Mapping for Enzymes
839
Figure 1. Overview of the procedure used to generate the database and numbers at various stages. Initially, there is a reduction in numbers of PDB entries covered as structures must possess a ligand. Another reduction occurs when crossreferencing these structures to CATH or SCOP. Subsequent reductions are incurred as an EC number is necessary for the cognate ligand mapping. The number of structures, superfamilies and EC numbers shown in the cognate ligand mapping for SCOP and CATH are subject to two cut-off scores on the assignment of the cognate ligand (see Methods). Of the 18,789 cognate ligands (above cut-offs) 12,193 (65%) could be found in the PDB; breaking these down, we find that 58% are cofactors. The total number of structures in the PDB refers to those in the MSDSD release 12 (6-12-2004).
840 Number of PDB ligands that are identical with the cognate ligands Our dataset has 18,789 cognate ligand assignments for domains in the SCOP database. By retrieving all the potential cognate ligands with a maximal graph matching score of 1, we find that 12,193 (65%) of the PDB ligands are exact matches to the relevant cognate ligands. Fifty-eight percent of these 12,193 PDB ligands are found to be the cofactors Zn, Cu, Mn, Mg, Fe, Ca, NAD/NADH, NADP/NADPH, FAD/FADH2, ATP/ADP/AMD, GTP/GDP/GMP, and CoA. A breakdown of these exact matches is shown in Figure 1. Proportion of dataset accounted for by cofactors Given that structures deposited in the PDB are often co-crystallised with a cofactor, an interesting question to ask is what proportion of the ligands in our mapping are cofactors? In our dataset we have 18,789 PDB ligands (bound by SCOP domains) for which we have assigned a potential cognate ligand with scores above the two designated cuf-offs (see Methods). One way to assess the proportion of cofactors is to simply make a list of KEGG ligands that can be considered cofactors. Using Zn, Cu, Mn, Mg, Fe, Ca, NAD/ NADH, NADP/NADPH, FAD/FADH 2 , ATP/ ADP/AMD, GTP/GDP/GMP, CoA and PLP as a short list, we find that 8150 (43%) of the ligands are cofactors. Of course, this is an approximation, since this does not include all possible cofactors, so that a few ”rare” cofactors may be missed, and these ligands may not always act as cofactors in all reactions. Also, this result is likely to be an underestimate, as cofactor derivates such as UDP, dGTP, dATP, dCTP, dTTP, IDP, IMP, uracil etc. may be present in some reactions and could have been assigned as a potential cognate ligands and are not included in the above list. An alternative approach is to count for each structure how many ligands have been assigned a cofactor role in a KEGG reaction. Using this method, we find that 5551 (30%) of the ligands are cofactors.
Cognate Ligand Domain Mapping for Enzymes
The accuracy of this method relies on the assignment of the cofactor role in KEGG and, since this is based on a manually annotated reaction, it is likely to be more accurate than the above method. Thus, in our mapping, the proportion of ligands that are cofactors is around 30–43%, indicating that 57–70% of the data will be either a substrate or a product in an enzyme reaction. It should be noted that there are no substrate analogues present in the mapping, as the list of potential cognates has been retrieved from known reactions for each enzyme (ENZYME and KEGG), which contain only substrates, products and cofactors. Comparison of the molecular mass of the cognate and PDB ligand Figure 2 shows the relationship between the molecular mass of a cognate ligand and that of the corresponding PDB ligand. Most of the potential cognate ligands have a molecular mass very similar to that of their PDB ligand, with 23% being identical, 33% of the cognate ligands are heavier and 44% are lighter than their PDB ligand counterparts. Given the ”closed pocket” character of many enzyme active sites, it is slightly easier to bind a smaller ligand, rather than a larger one. It should be noted that this comparison is not valid for polymeric substrates such as proteins, nucleic acids and oligosaccharides, as the KEGG ligands are represented as one monomer and the PDB ligand is often just one or a few monomers. Figure 2 shows that the PDB ligands vary between ∼0.5 and 2.0 times the size of the cognate ligand. Promiscuity in partners Which ligands have the most different partner domains, and which domains have the most partner ligands? We chose to measure this by the number of different superfamilies to which a ligand binds rather than individual PDB entries, avoiding overrepresenting ligands that are bound exclusively by a single superfamily with many structures; e.g. NAD,
Figure 2. Comparison of the molecular mass (in Daltons) of the ligands in the PDB structures (Y) and their respective cognate (X) ligand shows a correlation of R2 = 0.88. The y = 0.5x and y = 2x lines show that almost all the ligands in the PDB structures in this sample have a mass between half or double the molecular mass of the cognate molecule.
Cognate Ligand Domain Mapping for Enzymes
which is bound by the Rossmann domain in many related dehydrogenases. Promiscuous ligands. The left-hand side of Figure 3 shows the 40 PDB ligands in our dataset that are bound by the largest number of superfamilies. The magnitude of the non-cognate ligand problem within the PDB is clear; with the most frequent ligand being the sulphate ion followed by glycerol. Both these compounds are present frequently in the crystallisation buffer and are expected to bind nonspecifically in crystal structures. Additionally, other compounds in the list, such as 2-methyl-2,4-pentanediol (MPD), 2-amino-2-hydroxymethyl-propane1,3-diol (Tris), 2-(N-morpholino)-ethanesulfonic acid (Mes) and 4-(2-hydroxyethyl)-1-piperazine ethanesulfonic acid (Hepes), are commonly found in crystallisation buffers. Such compounds, although
841 bound in the crystal structure, are unlikely to have a biological role in catalysis or for signalling or regulatory purposes (although these molecules often bind in active sites, as well as other small pockets on the surface of the protein). In contrast, the list of the 40 most frequent cognate ligands in Figure 3 is entirely different, and is as expected when considering biological systems. The most common cognate ligands, the cofactors ATP and NAD or their component moieties e.g. ADP and AMP, occupy the top nine positions, with the exception of zinc in sixth place, presumably there by virtue of its role in many enzyme active sites and as a structural metal. Additionally, co-enzyme A (CoA) lies in tenth place, followed by FAD. These enzyme co-factors have ubiquitous roles. ATP is the universal cellular currency of energy; NADH or FADH2 are incorporated to exploit their reducing
Figure 3. The top 40 PDB and cognate ligands ordered by the number of superfamilies to which they bind. All ligands must have at least three contacts in total to the protein and the cognate ligand, must have a graph matching score of greater than 0.5 to the PDB ligand. The ligands listed here are bound in a non-shared manner by the domains; this means that a domain must contribute, ≥ 75% of the total contacts to any one ligand to “own” it and for the ligands binding to be assigned to it. For more details on the domain-ligand assignment procedure, see Methods. Ligands of particular biological significance are highlighted on both sides of the table for comparison.
842
Cognate Ligand Domain Mapping for Enzymes
Table 1. Top 40 superfamilies ranked by the number of different cognate ligands they bind
Superfamily ID
Superfamily Name
Number of different cognate ligands bound by superfamily
Table 1 (continued)
Superfamily ID
Superfamily Name
Number of different cognate ligands bound by superfamily
d.81.1
Glyceraldehyde-3-phosphate dehydrogenase-like, C-terminal domain
11
e.7.1
Carbohydrate phosphatase
11
c.1.7
NAD(P)-linked oxidoreductase
10
c.37.1
P-loop containing nucleoside triphosphate hydrolases
46
c.2.1
NAD(P)-binding Rossmann-fold domains
39
c.14.1
ClpP/crotonase
37
c.82.1
ALDH-like
10
c.67.1
PLP-dependent transferases
34
d.113.1
Nudix
10
c.1.8
(Trans)glycosidases
32
b.43.4
Riboflavin synthase domain-like
9
e.6.1
Acyl-CoA dehydrogenase NM domain-like
24
c.1.2
Ribulose-phoshate binding barrel
9
c.56.2
Purine and uridine phosphorylases
23
c.23.16
Class I glutamine amidotransferase-like
9
c.1.10
Aldolase
22
e.22.1
Dehydroquinate synthase-like
9
d.58.6
Nucleoside diphosphate kinases
22
b.3.6
Aromatic compound dioxygenase
8
c.26.1
Nucleotidylyl transferase
21
c.60.1
Phosphoglycerate mutase-like
8
b.82.2
Clavaminate synthase-lik
18
c.68.1
Nucleotide-diphospho-sugar transferases
18
c.1.9
Metallo-dependent hydrolases
17
c.7.1
PFL-like glycyl radical enzymes
17
d.17.4
NTF2-like
17
c.3.1
FAD/NAD(P)-binding domain
16
c.61.1
PRTase-like
16
c.66.1
S-Adenosyl-L-methioninedependent methyltransferases
16
c.69.1
Alpha/beta-hydrolases
16
d.114.1
5′-Nucleotidase (syn. UDP-sugar hydrolase), C-terminal domain
15
c.1.12
Phosphoenolpyruvate/pyruvate domain
14
a.93.1
Heme-dependent peroxidases
13
c.1.11
Enolase C-terminal domain-like
13
c.55.1
Actin-like ATPase domain
13
c.72.1
Ribokinase-like
13
b.82.1
RmlC-like cupins
12
c.1.4
FMN-linked oxidoreductases
12
c.108.1
HAD-like
12
c.87.1
UDP-glycosyltransferase/ glycogen phosphorylase
11
All ligands must have at least three contacts to the protein and the cognate ligand must have a graph matching score of greater than 0.5 to the PDB ligand. The cognate ligands listed here are bound in a non-shared manner by the domains; this means that a domain must contribute, ≥75% of the total contacts to any one ligand to “own” it and for the ligands binding to be assigned to it. For more details on the domain–ligand and cognate–ligand assignment procedures, see Methods.
power, and CoA carries the useful acyl group that is used in many reactions. Since frequency was measured in terms of the numbers of domain superfamilies that bind the cognate ligands, it is clear that many domains have evolved to bind this pool of common small molecules that are utilised in many reactions. Even more striking is the fact that these most common cognate ligands all share an adenine ribonucleotide moiety, highlighting the chemical unity of biology. The recognition by proteins of adenylate as a common biological moiety has been discussed at length by Moodie et al.,17 and more recently by Denessiouk et al.14,18 The nature of commonality in nucleotide recognition in proteins was first observed by Rossmann et al.1 and Ohlsson et al.2 in dehydrogenases and later discussed for other ligands and domains.9–13,15,16 Promiscuous domains. Table 1 shows the top 40 superfamilies ranked according to the number of different cognate ligands they bind. The superfamily that binds the largest number of different cognate ligands is the P-loop hydrolases superfamily. The P-loop hydrolases are well known as a highly promiscuous superfamily, that occur with many different partner domains in many sequences31 and structures.32 They are so called because they contain
Cognate Ligand Domain Mapping for Enzymes
a glycine-rich phosphate-binding loop that is responsible for binding the phosphate group of nucleotides such as ATP, GTP, and various NTPs. Proteins containing a P-loop hydrolase domain frequently hydrolyse these nucleotides.33 Indeed, most of the 46 cognate ligands bound by the P-loop hydrolases are the various forms of (deoxy)ribonucleoside 5′-(mono/di/tri)phosphates. The second most frequent binder of different cognate ligands, the Rossmann domain, is well known for its role in binding NAD(P) in various dehydrogenases.34 In our study, this diversity of bound ligands is due to the single-domain, shortchain dehydrogenase family,35 where it binds a wide range of dehydrogenase substrates in addition to the NAD(P) cofactor. Although this domain occurs in combination with many other different domains in the two-domain dehydrogenases, its binding activity in these enzymes is more monogamous and it mainly binds only NAD(P). This dichotomy illustrates an important point, that
843 domain superfamilies are repeatedly used for binding a particular cofactor (or class of molecule as in the P-loops), but they can evolve into a wide range of ligand specificities, as in the short-chain dehydrogenases. Domain and ligand promiscuity. Figure 4(a) shows for each superfamily the number of different cognate ligands bound versus the number of different covalently linked partner domain superfamilies. From this graph there is no obvious correlation between these two routes to diversity; all parts of the graph are occupied. For example, both the Rossmann domain and the P-loop hydrolases have many cognate ligands and many partner domains. However, the graph does not show the evolutionary origin of the diversity, which may arise from making diverse domain partnerships or just by evolving multiple different binding specificities within a single domain, as with the short-chain dehydrogenases. Another superfamily with many
Figure 4. (a) The graph shows the number of cognate ligands bound by a superfamily against the number of partner domains. The number of partner domains a superfamily has is assessed by counting the number of different superfamilies it is found in combination with in various chains of the SCOP database (1.67) rather than our cognate ligand–domain mapping, which is a subset. Superfamilies with 0 partners occur only in isolation on their own as singledomain proteins. Labels adjacent to a point of the graph show selected names of a few superfamilies. (b) The graph shows the number of cognate ligands bound by a superfamily versus its abundance. The abundance of a domain is measured by counting the number of structures in the SCOP database (1.67) in which it occurs; this avoids counting domains twice in instances of multiple identical chains or repeats, thus avoiding over-representation.
844
Cognate Ligand Domain Mapping for Enzymes
come from more than one chain, and by domains that are discontinuous in sequence. Figure 5 shows that just under half of all ligands are always bound by single domains, only 16% are always bound by more than one domain, and the remainder are mixed. This finding perhaps reflects the fact that a “one-domain one-ligand paradigm” is simpler and potentially more reusable in terms of allowing subsequent domain recombination events to bring new combinations of ligands together, so creating new functions. To illustrate this, we show an example of a non-shared ligand binding in Figure 6(a) and an example of shared binding in Figure 6(b). Domain combination and substrate specificity Figure 5. Distribution of ligands to the domain composition of their binding sites. The non-shared ligands are bound by a single domain, the shared ligands have sites constructed by more than one domain. The mixed ligands have differently classified binding sites in different structures (see Methods).
partners and many cognate ligands is the (trans) glycosidases, which have a TIM-α/β-barrel fold.36 Domains of this superfamily are frequently found in proteins that act on sugars and polysaccharides, such as starch or cellulose, and occur in various domain combinations totalling 14 different superfamilies in SCOP. Domain abundance and number of cognate ligands. Figure 4(b) shows a plot of the number of different cognate ligands bound by a superfamily versus the total number of proteins in SCOP (1.67) that have a domain of that particular superfamily. Generally speaking, there is no relationship between abundance and the number of cognate ligands bound by a superfamily. This plot reflects natural abundance, but also any bias within the PDB and the cognateligands identification procedure, which is limited to enzymes. Ligand binding by single or multiple domains Using our automated criteria and manually defined cut-off, ligands can be assigned as being bound by either a single domain (≥ 75% contacts to one domain) or multiple domains (<75% contacts made by any one domain) (see Methods). It should be noted that ligands can be bound by domains that
Some common or promiscuous domains are found in combination with different partner domains, and we can explore to what extent ligand specificity changes depending upon the partner domains. We discuss three illustrative examples: the P-loops, the Rossmann domain, and the class II amino aaRS and biotin synthetases domain. The P-loop hydrolase domains are frequently found in proteins that hydrolyse nucleotide triphosphates. The list of domain architectures along with the cognate substrates bound by the P-loop domain are shown in Table 2. Despite different domain combinations and different families of domain within the P-loop superfamily, the vast majority of cognate ligands bound by the P-loop domain are nucleotides or closely related molecules, with the exception of a few ligands that are present in reactions where a (deoxy)ribonucleoside 5′-(mono/ di/tri)phosphate or a similar compound is also used e.g. 3′-phosphoadenylyl sulphate. The second example is the Rossmann domain, whose role is to bind the NAD cofactor to provide reducing power for enzymes such as dehydrogenases (as discussed above). The ligands bound by the Rossmann domain in various domain architectures are shown in Table 3. In the various multi-domain architectures, the Rossmann domain binds primarily a reduced or oxidised version of the enzyme cofactor NAD, although in a few architectures it also binds glycerol (alcohol dehydrogenase substrate), pyruvate (lactate dehydrogenase product) and the cofactor CoA (in place of NAD in succinyl-CoA synthetase). In the singledomain, short-chain dehydrogenases (indicated by SCOP family c.2.1.1 in the Table) the Rossmann domain is now responsible for binding a wide range of different ligands in addition to the NAD cofactor, an activity that is normally located on the
Figure 6. (a) Glyceraldehyde-3-phosphate dehydrogenase, the Rossmann domain is shown in blue and the additional catalytic domain in orange, nicotinamide adenine dinucleotide (NAD) is shown in green. The NAD cofactor for this enzyme is bound by the Rossmann domain in a non-shared manner, the majority of the contacts to NAD (19 out of 21) are made by the Rossmann domain. The Figure was drawn from PDB entry 1gga.39 (b) An example of the shared binding of CoA between different chains of the homooligomer galactoside acetyltransferase. Chain A is show in orange, chain B in blue and chain C in purple, the three CoA molecules are shown in green. The Figure was drawn from the PDB entry 1kru.40
Cognate Ligand Domain Mapping for Enzymes
Figure 6 (legend on previous page)
845
846
Cognate Ligand Domain Mapping for Enzymes
Table 2. The domain context and substrate specificity of P-loop hydrolase domains Domain architecture
Family
Cognate ligands
1:b.47.1, c.37.1, c.37.1, 2:b.47.1
RNA helicase
Pyrophosphate
1:c.37.1, a.66.1, 2:c.37.1
Nucleotide and nucleoside kinases
ATP, GTP
1:c.37.1, g.41.2, 2:c.37.1
Nucleotide and nucleoside kinases
ATP
b.122.1, c.26.1, c.37.1
ATP sulfurylase C-terminal domain
Adenylylselenate, adenylylsulfate
b.49.1, c.37.1, a.69.1
RecA protein-like (ATPase-domain)
ATP, ADP, orthophosphate
c.37.1, c.108.1
Nucleotide and nucleoside kinases
ADP
c.37.1, c.60.1
6-Phosphofructo-2-kinase/fructose-2, 6-bisphosphatase, kinase domain
Orthophosphate
Single domain proteins
Nucleotide and nucleoside kinases
ATP, ADP, AMP CDP, CMP, GMP UDP, UMP, NDP, dAMP, dCDP, dCMP, dGMP, dTDP, dTMP, dUDP.dUMP, deoxyuridine, thymidine, deoxycytidine
Single domain proteins
Nitrogenase iron protein-like
ATP, ADP, GTP, GDP, IMP, N6-(1,2-dicarboxyethyl)AMP, orthophosphate
Single domain proteins
RecA protein-like (ATPase-domain)
ATP, ADP, GTP, GDP, pyrophosphate
Single domain proteins
Gluconate kinase
ATP, ADP, 6-phospho-D-gluconate
Single domain proteins
Shikimate kinase (AroK)
ADP, shikimate
Single domain proteins
Extended AAA-ATPase domain
ATP, ADP, dATP, dCTP, dGTP Deoxynucleoside triphosphate
Single domain proteins
Adenosine-5′-phosphosulfate kinase (APS kinase)
ADP, adenylylselenate, adenylylsulfate
Single domain proteins
PAPS sulfotransferase
Adenosine 3′,5′-bisphosphate, 3β-hydroxyandrost5-en-17-one, 3′-phosphoadenylyl sulphate, alkyl sulphate, estrone, phenol
Single domain proteins
Phosphoribulokinase/pantothenate kinase
ADP, GDP, CMP, UTP, UDP, UMP, dADP, dCTP, dGDP, dTTP, dUTP, IDP
Single domain proteins
G proteins
ATP, ADP, AMP, GMP
Rows are grouped according to domain architecture and family. Family refers to the sub-grouping of proteins within a SCOP superfamily all family names are taken from SCOP. Where the domain architecture column is blank, proteins are single domain only. Domain architecture is listed sequentially N terminus to C terminus in the second column as the SCOP superfamily code for each domain, where inserted and discontinuous domains are present numbers prefixed to the superfamily code indicate the different sequential sections of domains. Only ligands bound to a domain in a non-shared manner (see the text) were counted. Additionally all ligands must have at least three contacts to the protein and the cognate ligand must have a graph matching score of greater than 0.5 to the PDB ligand.
partner domain of the dehydrogenase in the multidomain examples. Thus, here it can be seen how a switch from multiple-domain to single-domain architecture corresponds to a change in ligand specificity. In general, in the two previous examples, the Rossmann domain is responsible for binding the cofactor NAD or NADP, and the P-loop binds a nucleotide or similar compound despite the domain architectural context. In the next example, the class II aaRS and biotin synthetase domain-specific domain combinations are associated with different ligand specificities. This domain is found as the catalytic domain in various tRNA synthetase enzymes (which catalyse the production of aminoacyl-tRNA molecules needed for translation), biotin synthetase (which catalyses the synthesis of biotinyl-5′-adylate
from biotin and ATP), and asparagine synthetase (which catalyses the synthesis of asparagine from aspartate and ATP). Table 4 shows the substrate specificity of the class II aaRS and biotin synthetase domain in conjunction with the different domain architectures. With different domain partners, the class II aaRS and biotin synthetase domains have different specificities for amino acid substrates. For the class II amino acid tRNA synthetase enzymes, this is because different superfamilies of partner domain assist the catalytic domain in binding different tRNA anti codons with the catalytic domain changing its amino acid substrate specificity accordingly. This example shows how knowledge of domain architectures and substrate specificity could be useful in function prediction for some superfamilies. For instance, it is clear from Table 4 that a
847
Cognate Ligand Domain Mapping for Enzymes Table 3. The domain context and substrate specificity of Rossmann domains Domain architecture
Family
Cognate ligands
1:b.35.1, c.2.1, 2:b.35.1
Alcohol dehydrogenase-like, C-terminal domain
NAD+, NADH, NADP+, NADPH, glycerol
1:c.2.1, 1:d.81.1, 2:c.2.1, 2:d.81.1
Glyceraldehyde-3-phosphate dehydrogenase-like, N-terminal domain
NADP+, NADPH
1:c.2.1, d.81.1, 2:c.2.1
Glyceraldehyde-3-phosphate dehydrogenase-like, N-terminal domain
NAD+, NADH, NADP+, NADPH
1:c.2.1, d.81.1, 2:c.2.1, a.69.3
Glyceraldehyde-3-phosphate dehydrogenase-like, N-terminal domain
NADPH
1:c.23.12, c.2.1, 2:c.23.12
Formate/glycerate dehydrogenases, NAD-domain
NAD+, NADP+, pyruvate
1:c.23.12, c.2.1, 2:c.23.12, d.58.18
Formate/glycerate dehydrogenases, NAD-domain
NAD+
c.2.1, a.100.1
6-Phosphogluconate dehydrogenase-like, N-terminal domain
NAD+, NADH, NADPH, NADP+
c.2.1, a.100.1, c.26.3
6-Phosphogluconate dehydrogenase-like, N-terminal domain
NAD+, NADH
c.2.1, c.23.4
CoA-binding domain
CoA
c.2.1, d.162.1
LDH N-terminal domain-like
NAD+, NADH, NADP+
c.2.1, e.37.1
Siroheme synthase N-terminal domain-like
NAD+
c.2.1, e.37.1, c.90.1
Siroheme synthase N-terminal domain-like
NAD+
c.58.1, c.2.1
Amino acid dehydrogenase-like, C-terminal domain
NAD+, NADH, NADP+, NADPH, ADP
Single domain proteins
Tyrosine-dependent oxidoreductases (short-chain dehydrogenases)
NAD+, NADH, NADP+, NADPH, ADP-D-glycero-D-mannoheptose, ADP-L-glycero-D-manno-heptose, dTDPglucose, GDPmannose, GDP-4-dehydro-6-deoxy-D-mannose, UDPglucose, UDP-6-sulfoquinovose, UDP-D-galactose, Dideoxy-4-oxo-dTDP-D-glucose, dTDPgalactose, dTDP-6-deoxyL-mannose, dTDP-4-dehydro-6-deoxy-alpha-D, glucose, estrone, estriol, estradiol-17β,17β-diol, androstan-3α,16α-hydroxyestrone, 17β-hydroxyandrostan-3-one, 3α-7α-12α-trihydroxy-5βcholanate, HSO−3 , 4,6-biliverdin, vermelone, (S)-methylmalonate, semialdehyde, glycerol, D-glyceraldehyde, pseudotropine, reduced riboflavin. riboflavin, scytalone, tropine, tropinone
Rows are grouped according to domain architecture and family. See the legend to Table 2.
domain architecture of a.4.5, d.104.1, b.34.1 is indicative of biotin binding in the biotin repressor, a.4.5 being a winged helix DNA-binding domain. The architecture b.40.4, d.104.1 is indicative of lysyltRNA synthetase, d.104.1, c.51.1 is indicative of histidyl-tRNA and threonyl-tRNA synthetases, d.15.10, d.67.1, d.104.1, c.51.1 is also indicative of threonyl-tRNA synthetase and a single d.104.1 domain of asparagine synthetase. We are currently using our dataset as a starting point to carry out further investigations as to how domain architecture affects substrate specificity.
Domains that bind the most chemically similar and different ligands Using hashed fingerprints to represent all ligands available for a superfamily, it is possible to compare the chemical similarity of all ligands bound by a particular superfamily. By doing this, we can assess which superfamilies are able to bind the most chemically diverse or most similar range of ligands. For each superfamily that binds more than four different ligands in a non-shared manner, we calculated the average ligand similarity by doing
848
Cognate Ligand Domain Mapping for Enzymes
Table 4. The domain context and substrate specificity of the class II aaRS and biotin synthetase domain Domain architecture
Family
Cognate ligands
1:a.6.1, b.40.4, 2:a.6.1, d.138.1, a.6.1, d.104.1, d.58.13
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
AMP
a.2.7, d.104.1
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
AMP
a.4.5, d.104.1, b.34.1
Biotin holoenzyme synthetase
Biotin
b.40.4, 1:d.104.1, d.74.4, 2:d.104.1
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
AMP
b.40.4, d.104.1
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
ATP, AMP, L-lysine, pyrophosphate
d.104.1, c.51.1
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
ATP, AMP, L-histidine, L-threonine
d.104.1, c.51.1, d.68.5
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
AMP
d.15.10, d.67.1, d.104.1, c.51.1
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
ATP, AMP, L-threonine
Single domain proteins
Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic domain
AMP, L-asparagine
Rows are grouped according to domain architecture and family. See the legend to Table 2.
an all-against-all ligand comparison. We require at least four ligands to be bound, as this generally filters out cases of a single enzyme in a superfamily binding one substrate, one product and one cofactor which, rather than being an inter-enzyme comparison of substrate specificity, is an intra-enzyme comparison. Some of the superfamilies exhibit high average ligand similarities among their cognate ligands, whereas other superfamilies show very low average similarities, corresponding to a diversity of structural scaffolds amongst their cognate ligands. Five examples of each type are shown in Figure 7(a) and (b).
Conclusion The recognition of small molecules by proteins is of fundamental importance in understanding protein function and the role of proteins in the wider cellular context of pathways and processes. It has also become very clear that domains are the evolutionary and functional units of proteins, directing protein function, and the genesis of new proteins with new functions. To this end, we have automatically identified cognate ligands for enzymes in the PDB and associated these ligands with their binding domains. This analysis and the associated data resource will be of help in understanding the principles of protein function and evolution with regard to small molecule binding. Subsequent studies will investigate the role of domain combinations in substrate specificity of a protein. We hope that others will find the database useful for protein function prediction and as an aid to help protein crystallisation by suggesting possible ligands for uncharacterised proteins belonging to particular superfamilies. In a comprehensive review, Ma and co-workers highlighted the issue that a specific site in a protein may in fact be capable of binding a wider variety of
ligands than traditionally thought.37 We agree with this, and acknowledge that in vivo an enzyme could bind a much wider range of ligands than found in the PDB. Here, we have restricted the similarity search to ligands known to occur in the documented reaction of each enzyme, using EC numbers and KEGG reactions. Thus, one caveat of our work is that it is restricted to the known reactions available in public databases for these enzymes, and does not include all possible ligands. For any analysis of protein–ligand interactions, especially attempts to predict the cognate ligand for a new structure of unknown function, it is important to use a benchmark test including only biologically relevant ligands. This database will facilitate the creation of such benchmark test sets.
Methods Protein–ligand contact information Protein–ligand contact data for this study are taken from the Macromolecular Structure Database (MSD)6 search database (MSDSD), allowing us to count the number of contacts made between a ligand and each domain in SCOP3 and CATH.4 The contact data includes hydrogen bonds, van der Waals interactions, ionic and covalent bonds, and aromatic ring interactions extracted from the most-probable (PQS-like) biological assembly of protein structure,26 thus including biologically relevant protein-ligand contacts missing from the asymmetric unit as deposited in the PDB. Domain ligand assignment The MSDSD also provides the assignment of each residue in the chain to the appropriate SCOP and CATH protein domains. Using these mappings and the protein– ligand contact data, we retrieve the total number of
Cognate Ligand Domain Mapping for Enzymes
849 Figure 7. (a) The cognate ligands of five superfamilies each exhibiting high average ligand similarities and (b) five superfamilies exhibiting very low average ligand similarity calculated for their corresponding sets of cognate ligands. These are shown in a two-dimensional plot, where similar ligands are expected to lie close in space. The plot is produced using multidimensional scaling (in two dimensions) of the dissimilarity matrix produced using hashed fingerprint-based Tanimoto scores. In this space, ligands with similar connectivity are expected to be closer in space than ligands of very different connectivities. Each plot contains 1307 small molecules, 1193 of which are PDB ligands from the MSD and are used to create the background of the plot (in grey) and 114 are cognate ligands for the ten superfamilies showcased in these plots. The colours correspond to the following SCOP superfamilies: in plot (a) blue= e.6.1 acyl-CoA dehydrogenase NM domain-like, red = d.17.4 NTF2-like, orange = c.1.15 xylose isomeraselike, green = c.80.1 SIS domain, purple = a.100.1. 6-phosphogluconate dehydrogenase C-terminal domainlike. In plot (b) the colours correspond to the following SCOP superfamilies: blue = c.94.1 periplasmic binding protein-like II, red = c.69.1 Fe-only hydrogenase, orange = d.104.1 class II aaRS and biotin synthetases, green = a.93.1 hemedependent peroxidases, purple = c.1.9 metallo-dependent hydrolases.
contacts made to any one ligand by the whole structural assembly and each domain in each chain. Many ligands contact more than one domain. To handle this, if any one domain has ≥ 75% of the total contacts to a particular ligand, then the binding of that ligand is assigned to that domain, and the mode of binding is recorded as non-shared. If no single domain has ≥ 75% of the contacts to any one ligand, then all domains that contact that ligand will be recorded as binding that particular domain in a shared manner. A cut-off of 75% was chosen after careful and detailed manual examination of ligand–domain interactions in several different protein structures that were representative of different scenarios. It is a good compromise between being too lenient, leading to the situation where even marginal contacts lead to all domains of a structure binding a ligand, and too strict, so that significant contributions of different domains are ignored. The resultant domain–ligand assignments for both the SCOP and CATH database generated by the procedure are stored in the internal MySQL database. Identifying cognate ligands The basis of the identification of the cognate ligands is the assignment of EC numbers to the proteins in the PDB.
An EC number describes the function of the protein in a hierarchical manner, starting from the very generic level of function (e.g. 1=oxidoreductase) and ending with the substrate specificity (e.g. 1.1.1.9 corresponds to D-xylulose reductase, where the substrate is D-xylulose and the product is xylitol). There are complications, in that different levels have different meanings according to the primary class, and the last digit can also be generic, e.g. 1.1.1.1 represents alcohol dehydrogenases, without specifying the alcohol. However, this is still the most reliable way of automatically assigning possible substrates and products to a protein, without the need to consult the literature individually for each enzyme, although of course this restricts the assignment to enzymes. With 13,217 enzymes with EC assignments currently in the MSD database, it is easy to see the advantage of having a non-manual way of performing the substrate assignments. From the PDB code to the cognate ligands of the relevant protein We extracted the EC numbers from the MSD, rather than directly from the PDB file. This is beneficial, as the MSD obtains its EC numbers via a careful sequence mapping to the Uniprot protein sequence database,38 which is a highquality, semi-manually-annotated resource.
850
Cognate Ligand Domain Mapping for Enzymes
To associate possible reactants and products to this EC number, the “reaction” file from the KEGG database is used. This provides the link between the EC numbers, the KEGG reaction IDs and the small molecule IDs involved in these reactions (extracted from the reaction equation in this file). Some EC numbers are associated with cofactors, which are not always present in KEGG. If the cofactor is seen as a reactant (e.g. NAD or ATP), then it is most likely listed. However, we have found that it is common for metals acting as cofactors not to be listed in KEGG reaction equations. To cover these cases, we have additionally used information from the ENZYME database (release 37.0, March 2005), the official database for the nomenclature of enzymes. The reason we do not use this database to map EC numbers to substrates and products is that small molecules in this database are simply identified by their names, and chemical names are notoriously difficult to map accurately. In the case of the cofactors, and as we have no alternative source of information, we perform name matching between names appearing in the ENZYME database (CF) field and names appearing in the KEGG Compound database (using all available synonyms). We combine the information obtained from KEGG and that obtained from the ENZYME database and derive a table where each EC number is associated with a small molecule in a particular reaction.
mechanism, e.g. the adenine ring of NAD may match up with that of an NAD analogue that lacks the active nicotinamide ring. In order to perform graph matching and to be able to identify the matched atoms/bonds, we need to use a molecular format whereby atoms can be identified uniquely (SMILES strings, for example, are not suitable for this procedure). We use the MDL mol format, as this is the format used by the KEGG database. PDB ligands are not available directly in this format, but the MSD database contains information in the CHEM_ATOM and CHEM_BOND tables that allows us to generate these files. Although an all-by-all comparison between PDB ligands and KEGG small molecules would be useful, here we limit the comparisons to the small molecules found in the assigned EC number(s) with the PDB ligands for the given structure. Similarity comparisons were performed using our own Java program, which makes use of the CDK library. This produces a score based on the largest common substructure found, and a list of atom pairs that have been mapped (currently only one list is kept, although many alternative mappings of the same size could be found). Having obtained the largest common substructure for molecules A and B, we calculate the Tanimoto-like score, S, as:
Obtaining the PDB ligands for a given complex
where Nsub is the number of atoms in the maximum common substructure, NA is the number of atoms of molecule A and NB is the number of atoms in molecule B. For purposes of comparison, we have calculated hashed fingerprint-based scores using the algorithm implemented in the CDK library. The Tanimoto score between any two fingerprints is then given by the equation:
We use the MSDSD database to extract all bound small molecule ligands, excluding water, proteins and nucleic acids. Each ligand is identified uniquely and information on its chemical identity is extracted where available. In the MSD database, each ligand is mapped to a library of small molecules, and if the mapping is successful, the ligand is assigned an ID (chem_comp_id), which identifies it uniquely as a chemical entity (e.g. all ATP molecules in the MSD database are assigned the same ID, 794). This ID allows us to retrieve the connectivity information for all ligands that have been mapped to the chemical library. Comparing ligand structures To test if the PDB ligands are cognate or cognate-like, we need to compare them to the EC and KEGG-defined ligands and co-factors. There are two general ways to compare 2D structures of molecules: one method treats the molecules as graphs and uses graph-matching algorithms; the other uses a hashed fingerprint description of the molecules (i.e. a series of binary or real numbers that may encode structural or physicochemical properties of the molecules) and compares the hashed fingerprints. The latter is faster and easier to implement, but can lead to inaccuracies, i.e. two molecules may be seen as identical, even though they are not. The graph-matching approach does not suffer from this problem, but the complexity of the algorithms involved means that it is not always possible to obtain an answer in a reasonable amount of time. As we are interested in knowing which parts of the two ligands have been matched, we use the graphmatching approach, as implemented in the Chemistry Development Kit (CDK) open source software.28 A caveat of this approach is that the largest common (and connected) subgraph between two molecules is not necessarily biologically meaningful, but we make the assumption that in the majority of times it will be a reasonable choice. For instance, the common scaffold between two molecules may not include the active part of a ligand containing the group(s) involved in the reaction
S ¼ Nsub =ðNA þ NB Nsub Þ
T ¼ Bitscommon =ðBitsA þ BitsB Bitscommon Þ where T is the Tanimoto score, BitsA is the number of bits set in fingerprint A, BitsB is the number of bits set in fingerprint B, and Bitscommon is the number of bits set in both A and B fingerprints. Hashed fingerprints of 1088 bits for all PDB ligands and cognate ligands were created using the CDK Fingerprinter class. Graph and fingerprint scores Similarity scores associated with the PDB ligandcognate ligand pairs compared in the database do not have an absolute meaning, i.e. one cannot associate a given value of the score with an explicitly defined and generally agreed upon value of similarity (this problem is not specific to our way of calculating similarity; there simply does not exist a unique way of defining molecular similarity). For comparison, we calculated both hashed fingerprint and graph-matching scores for a random sample of 1190 pairs of molecules, where one molecule comes from the KEGG database and the other comes from the MSD ligand database. We find that approximately 99% of all random graph-matching scores are equal to or less than 0.5; hence, we can safely consider values higher than that as significant. The fingerprint scores are also generally low, with only about 0.6% of all random scores being higher than 0.5. Cut-offs used to filter protein-ligand contacts and cognate ligand assignment We choose to use two cut-off values to facilitate analysis of the data. Although clearly the number of contacts
851
Cognate Ligand Domain Mapping for Enzymes
between a protein and ligand is a continuum, the first cutoff is that all ligands must have at least three contacts to the protein. This eliminates peripheral ligands that, due to peculiarities of the asymmetric unit or the symmetry transformations used in the quaternary assembly, make only tenuous contacts to the protein. The second is that in order for a PDB ligand to qualify as cognate-like, it must match one of the enzyme's cognate KEGG ligands from the relevant KEGG reactions with a graph-matching score greater than 0.5 to ensure a good chemical match. Cognate ligand assignment Each ligand associated with a domain in the ‘domain– ligand assignment’ is considered in turn with the information in the cognate ligand mapping table. For each EC number KEGG reaction_id pair associated with a structure, the highest scoring cognate ligand match to each PDB ligand is recorded. These tables associate domains of the SCOP and CATH databases with their likely cognate ligands. If two or more likely cognate ligands have the same graph-matching score, then both are recorded as likely cognate ligands for that domain. Binding site characterisation (shared and non-shared binding) The assessment of shared and non-shared ligand binding (Figure 5) was produced from a list of probable cognate ligands found in SCOP domains either flagged as being bound in a shared or non-shared way, with a graphmatching score of greater than 0.5 and with a minimum requirement of three contacts for each ligand in the whole assembly. These values ensure that only good matches are used for the calculation, and that ligands bound on the outside of assemblies are not counted. There are a total of 724 cognate ligands in the SCOP domain mapping where the graph-matching score is greater than 0.5, and there are at least three contacts to each ligand in an assembly. PROCOGNATE flat file The cognate ligand mapping for both the SCOP and CATH domain classifications generated in this study are freely available†.
Acknowledgements We acknowledge Adel Golovin, Dimitris Dimitropoulos, Peter Keller and Tom Oldfield, the members of the MSD group at the EBI who gave us some very useful help and support regarding the MSDSD. M.B. was supported by NIH grant (GM 62414), US DOE under contract (W-31-109-ENG38), I.N. was funded by a Special Training Fellowship in Bioinformatics from the Medical Research Council. EBI thanks IBM for the use of an IBM eServer Blade Center for use in its research work. We thank Gabrielle Reeves for helpful comments and suggestions on the manuscript. † http://www.ebi.ac.uk/thornton-srv/databases/ procognate/flat_file.html
References 1. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974). Chemical and biological evolution of nucleotidebinding protein. Nature, 250, 194–199. 2. Ohlsson, I., Nordstrom, B. & Branden, C. I. (1974). Structural and functional similarities within the coenzyme binding domains of dehydrogenases. J. Mol. Biol. 89, 339–354. 3. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 4. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH–a hierarchic classification of protein domain structures. Structure, 5, 1093–1108. 5. Berman, H., Henrick, K. & Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nature Struct. Biol. 10, 980. 6. Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, A., Barrell, D., Apweiler, R. & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 33, D262–D265. 7. Bairoch, A. (2000). The ENZYME database in 2000. Nucl. Acids Res. 28, 304–305. 8. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. & Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 27, 29–34. 9. Schulz, G. E. & Schirmer, R. H. (1974). Topological comparison of adenyl kinase with other proteins. Nature, 250, 142–144. 10. Schulz, G. E. (1980). Gene duplication in glutathione reductase. J. Mol. Biol. 138, 335–347. 11. McKie, J. H. & Douglas, K. T. (1991). Evidence for gene duplication forming similar binding folds for NAD(P) H and FAD in pyridine nucleotide-dependent flavoenzymes. FEBS Letters, 279, 5–8. 12. Walker, J. E., Saraste, M., Runswick, M. J. & Gay, N. J. (1982). Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J. 1, 945–951. 13. Kobayashi, N. & Go, N. (1997). ATP binding proteins with different folds share a common ATP-binding structural motif. Nature Struct. Biol. 4, 6–7. 14. Denessiouk, K. A. & Johnson, M. S. (2000). When fold is not important: a common structural framework for adenine and AMP binding in 12 unrelated protein families. Proteins: Struct. Funct. Genet. 38, 310–326. 15. Schulz, G. (1992). Binding of nucleotides by proteins. Curr. Opin. Struct. Biol. 2, 61–67. 16. Traut, T. W. (1994). The functions and consensus motifs of nine types of peptide segments that form different types of nucleotide-binding sites. Eur. J. Biochem. 222, 9–19. 17. Moodie, S. L., Mitchell, J. B. & Thornton, J. M. (1996). Protein recognition of adenylate: an example of a fuzzy recognition template. J. Mol. Biol. 263, 486–500. 18. Denessiouk, K. A., Rantanen, V. V. & Johnson, M. S. (2001). Adenine recognition: a motif present in ATP-, CoA-, NAD-, NADP-, and FAD-dependent proteins. Proteins: Struct. Funct. Genet. 44, 282–291. 19. Chalk, A. J., Worth, C. L., Overington, J. P. & Chan, A. W. (2004). PDBLIG: classification of small molecular protein binding in the Protein Data Bank. J. Med. Chem. 47, 3807–3816. 20. Golovin, A., Dimitropoulos, D., Oldfield, T., Rachedi,
852
21.
22. 23.
24.
25. 26. 27.
28.
29.
30.
Cognate Ligand Domain Mapping for Enzymes
A. & Henrick, K. (2005). MSDsite: a database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins: Struct. Funct. Genet. 58, 190–199. Laskowski, R. A., Chistyakov, V. V. & Thornton, J. M. (2005). PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids. Nucl. Acids Res. 33, D266–D268. Hendlich, M. (1998). Databases for protein-ligand complexes. Acta Crystallog. sect. D, 54, 1178–1182. Feng, Z., Chen, L., Maddula, H., Akcan, O., Oughtred, R., Berman, H. M. & Westbrook, J. (2004). Ligand Depot: a data warehouse for ligands bound to macromolecules. Bioinformatics, 20, 2153–2155. Alfarano, C., Andrade, C. E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K. et al. (2005). The biomolecular interaction network database and related tools 2005 update. Nucl. Acids Res. 33, D418–D424. Gold, N. D. & Jackson, R. M. (2006). SitesBase: a database for structure-based protein-ligand binding site comparisons. Nucl. Acids Res. 34, D231–D234. Henrick, K. & Thornton, J. M. (1998). PQS: a protein quaternary structure file server. Trends Biochem. Sci. 23, 358–361. Halperin, I., Ma, B., Wolfson, H. & Nussinov, R. (2002). Principles of docking: an overview of search algorithms and a guide to scoring functions. Proteins: Struct. Funct. Genet. 47, 409–443. Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E. & Willighagen, E. (2003). The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci. 43, 493–500. Abad-Zapatero, C., Griffith, J. P., Sussman, J. L. & Rossmann, M. G. (1987). Refined crystal structure of dogfish M4 apo-lactate dehydrogenase. J. Mol. Biol. 198, 445–467. Nobeli, I., Spriggs, R. V., George, R. A. & Thornton, J. M. (2005). A ligand-centric analysis of the diversity
31. 32.
33. 34. 35.
36.
37.
38.
39.
40.
and evolution of protein-ligand relationships in E. coli. J. Mol. Biol. 347, 415–436. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325. Park, J., Lappe, M. & Teichmann, S. A. (2001). Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol. 307, 929–938. Saraste, M., Sibbald, P. R. & Wittinghofer, A. (1990). The P-loop–a common motif in ATP-and GTP-binding proteins. Trends Biochem. Sci. 15, 430–434. Bashton, M. & Chothia, C. (2002). The geometry of domain combination in proteins. J. Mol. Biol. 315, 927–939. Persson, B., Kallberg, Y., Oppermann, U. & Jornvall, H. (2003). Coenzyme-based functional assignments of short-chain dehydrogenases/reductases (SDRs). Chem. Biol. Interact. 143–144, 271–278. Nagano, N., Orengo, C. A. & Thornton, J. M. (2002). One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J. Mol. Biol. 321, 741–765. Ma, B., Shatsky, M., Wolfson, H. J. & Nussinov, R. (2002). Multiple diverse ligands binding at a single protein site: a matter of pre-existing populations. Protein Sci. 11, 184–197. Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S. et al. (2005). The Universal Protein Resource (UniProt). Nucl. Acids Res. 33, D154–D159. Vellieux, F. M., Hajdu, J., Verlinde, C. L., Groendijk, H., Read, R. J., Greenhough, T. J. et al. (1993). Structure of glycosomal glyceraldehyde-3-phosphate dehydrogenase from Trypanosoma brucei determined from Laue data. Proc. Natl Acad. Sci. USA, 90, 2355–2359. Wang, X. G., Olsen, L. R. & Roderick, S. L. (2002). Structure of the lac operon galactoside acetyltransferase. Structure (Camb), 10, 581–588.
Edited by M. Sternberg (Received 12 June 2006; received in revised form 12 September 2006; accepted 15 September 2006) Available online 20 September 2006