The geometry of domain combination in proteins1

The geometry of domain combination in proteins1

doi:10.1006/jmbi.2001.5288 available online at http://www.idealibrary.com on J. Mol. Biol. (2002) 315, 927±939 The Geometry of Domain Combination in...

897KB Sizes 0 Downloads 30 Views

doi:10.1006/jmbi.2001.5288 available online at http://www.idealibrary.com on

J. Mol. Biol. (2002) 315, 927±939

The Geometry of Domain Combination in Proteins Matthew Bashton* and Cyrus Chothia MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England

Most proteins in genomes are the result of the recombination of two or more domains. It has been found that if proteins are formed by a combination of domains from superfamilies A and B, then the domains may occur in the sequential order AB or BA but only in about 2 % of cases do they occur in both sequential orders. The classical Rossmann domains of known structure are combined with catalytic domains from seven different superfamilies. In addition, there are eight cases where structures with both AB and BA domain combinations are known. For these two sets of structures, we analysed: (i) the relative orientation of the domains; (ii) the type of domain connection; (iii) the structure of the interdomain links; and (iv) domain function. The results of this analysis indicate that in most cases domain order is conserved because recombination of the domains has only occurred once during the course of evolution. Functional reasons become important when the domain connections are short. In seven out of the eight known cases where domains are combined in the AB and BA sequential orders they have different geometrical relationships that give them different functional properties. # 2002 Academic Press

*Corresponding author

Keywords: evolution; Rossmann domains; catalytic domains; cyclic permutation of genes

Introduction Examination of the early protein structures showed that many are formed by the recombination of domains.1 Many of these domains can have homologues that occur in combination with different partners, in isolation or in both states. Analysis of the genome sequences of Mycoplasma genitalium indicates that some two-thirds of its proteins are built by combinations of two or more domains. Other prokaryotes are likely to have the same proportion, and in eukaryotes it is likely to be somewhat higher.2,3 The pairwise combinations of domains that occur in the proteins of known structure and those from genome projects have been analysed by Park et al.4 and by Apic et al.5 These analyses determined the extent that the members of different protein superfamilies combine with one another and in what sequential order. They found that in the superfamilies that make combinations, most do so with just one or two other families. Conversely, Abbreviations used: NAD‡, nicotinamide adenine dinucleotide; FAD, ¯avin adenine dinucleotide; SCOP, Structural Classi®cation of Proteins. E-mail addresses of the corresponding authors: [email protected] 0022-2836/02/040927±13 $35.00/0

there are few superfamilies that make combinations with a larger number of other families. One of these is the superfamily of Rossmann domains that bind the coenzyme nicotinamide adenine dinucleotide (NAD‡). They also found that particular pairs of superfamilies almost always combine in the same sequential order. Thus, if proteins from superfamilies A and B are found in combination they may be in the sequential order AB or BA, but very rarely in both combinations.5 Two explanations for this conservation of order could be that (i) it is historical: the recombination event occurred once and current proteins are the descendants of the original pair; and/or (ii) it is a functional requirement: the observed combination order is required for formation of the active site. Here, we examine the adequacy of such explanations by an analysis of the connections and domain arrangements in two sets of proteins. The ®rst set consists of the classical Rossmann domains and their different catalytic partners. The second set consists of the eight sets of proteins that were identi®ed by Apic et al.5 as having domains linked in both sequential orders, i.e. AB and BA. We ®rst describe the Rossmann domains; the different types of connections that they make to catalytic domains; and the extent to which these connections are conserved and their # 2002 Academic Press

928 relation (or lack of it) to function. Secondly, we describe brie¯y similar work for the second set of proteins. These analyses were carried out mainly using Dr Arthur Lesk's PINQ program.6 We end by discussing the implications of our results for general aspects of domain recombinations.

The Geometry of Domain Combination in Proteins

Amongst these structures, there are two that are exceptions to the picture of simple two domain proteins with one binding NAD‡ and the other catalytic. The Rossmann domain of succinyl-CoA synthase (2scu) binds coenzyme A and not NAD‡. UDP-glucose dehydrogenase (1dlj) has a third, UDP-binding, domain joined to the catalytic domain.

Proteins of known structure that have classical Rossmann and catalytic domains Rossmann domains are linked to a great variety of different catalytic domains. The domains, for which the three dimensional structures are known, are listed and classi®ed in the Structural Classi®cation of Proteins (SCOP) database.7 In this database, proteins, or protein domains, are classi®ed in a hierarchical set of categories. In ascending order these are: family, superfamily, fold and class. A family clusters together proteins for which sequence similarity alone implies an evolutionary relationship. A superfamily clusters together families for which the combination of structural and functional features implies a common evolutionary origin even though the degree of sequence similarity is low. A fold clusters together superfamilies and/or families that have the same major secondary structures in the same topological arraignment.7 In order to delineate the repertoire of domain partners for the Rossmann domain we used information collected from the SCOP database (version 1.53) and from the ASTRAL database which contains a list of the sequences of SCOP entries.8 Here, we are concerned with the classical ``NAD(P)-binding Rossmann-fold domains'' and its partners. There are also more distantly related domains, described in SCOP as the ``FAD/NAD(P)-binding domain'' and the ``nucleotide-binding domain'', but these are not considered here. We also do not consider the few ``short chain'' dehydrogenases where the role of this second domain is played by medium sized insertions within the Rossmann domain. Inspection of the databases shows that, for the enzymes of known structure that contain classical Rossmann domains, the catalytic domains to which they are linked come from one of seven different superfamilies. One of these superfamilies is formed by a cluster of ®ve families, three superfamilies are each formed by clusters of three families, and three superfamilies have only one family. Thus the catalytic domains linked to classical Rossmann domains come from 17 different families which belong to one of seven different superfamilies. The seven superfamilies and the families that form them are listed in Table 1. In this Table the superfamilies are numbed from 1 to 7 and will be referred to by these numbers hereinafter. For the work described here the atomic co-ordinates of the structure of a representative member of each of the 17 families were obtained from protein data bank (PDB).9

Structure and function of classical Rossmann domains Rossmann domains bind the coenzyme NAD‡. This coenzyme takes part in redox reactions which use the ability of its nicotinamide ring to reversibly accept a hydrogen. As mentioned above, the Rossmann domains discussed here are linked to catalytic domains which play the major role in determining substrate speci®city and the precise reaction of the enzyme. Rossmann domains have an a/b fold. There is a central b-sheet, with usually six parallel strands arranged in the order 6, 5, 4, 1, 2, 3, and approximately ®ve a-helices which together with a number of loops connect the strands (Figure 1(a)). In the simplest Rossmann domains, the structure begins at the ®rst b-strand and ends at the sixth b-strand. The more complex structures have extensions at the N or C terminus or at both termini. The inter-sheet crossovers form a crevice in the C-terminal end of the b-sheet to form a binding site for NAD‡.10 If Rossmann domains taken from structures that have unrelated catalytic domains are superposed, it is apparent that there are only small local differences in the conformation and position of NAD (Figure 1(a)). A referee of this paper pointed out to us that after completion of this work a structure was published which contained a Rossmann domain whose NAD is bound in a different orientation to that discussed here. We discuses this structure brie¯y below. In redox enzymes the substrate is bound mainly by the catalytic domain and a hydride ion is transferred to NAD‡. Thus, the substrate and NAD must be brought into close proximity and in an orientation that enables the transfer of the hydrogen from the substrate to C4 of the nicotinamide ring. This places strong functional constraints on the relative positions of NAD‡ and substrate. Figure 1(b) shows the relative positions of the coenzyme-substrate pair in one representative structure of the six superfamilies that bind NAD‡. It can be seen that the NAD‡ enzymes and different substrates are placed relative to each other in the same orientation in each of the superfamilies: taking the nicotinamide ring in Figure 1(b) to be in the plane of the page, the reactive end of each of Ê from the the substrates is below the page some 3 A C4 group.

The Geometry of Domain Combination in Proteins

Table 1. Superfamilies that are linked to classical Rossmann domains

The column ``linkage with respect to Rossmann fold domain'' illustrates how families are sequentially related. R, Rossmann fold domain; C, catalytic domain in another superfamily; A, another domain. a Excluding hits to the same protein and multiple orthologs. b This family is complicated by C-terminal extensions to the Rossmann domain that pack against the catalytic domain. c Number next to PDB code in the last column indicate reference for that structure see below.

929

930

The Geometry of Domain Combination in Proteins

Types of connections between catalytic and Rossmann domains We can de®ne four different types of connections between catalytic and Rossmann domains. The Rossmann domain can be joined to the N terminus of the catalytic (type I) or to its C terminus (type II); the catalytic domain can be inserted within the Rossmann domain (type III), or the Rossmann domain can be inserted within the catalytic domain (type IV). Within families, the type of domain connection is always the same and in Figure 2 we show the relative positions of the domains in a representatives of each of the 17 families listed in Table 1. The data in this ®gure shows that the different families that form a given superfamily also have the same the type of connections. For example, the structures that represent the ®ve families that form superfamily 2 all have the Rossmann linked to the N terminus of the catalytic domain (Figure 2). The second conclusion that can drawn from the ®gure is that different superfamilies can have different types of connections and that all four types occur. Thus superfamilies 1, 2 and 3 have the Rossmann domain linked to the N terminus of the catalytic domain (type I); superfamily 4 have the Rossmann linked to the C terminus of the catalytic domain (type II); superfamily 5 has the catalytic domain inserted into the Rossmann domain (type III); and superfamilies 6 and 7 has the Rossmann domain inserted in the catalytic domain (type IV). Conservation within superfamilies of the relative position of the domains and of their linker regions

Figure 1. (a) Classical Rossmann fold with NADs (one representing each superfamily) superposed via there Rossmann domains. The Rossmann domain shown is from the structure lactate dehydrogenase (1ldm). The position of the bound NAD is conserved across the different superfamilies, there is some variance in the local conformation particularly the nicotinamide ring (see below). (b) NAD and substrates/inhibitors. Each structure represents one of the six superfamilies that uses NAD. It can be seen from this Figure that the conformation of the nicotinamide ring can be rotated about 180  about the glycosidic bond. Rings with the amide group pointing downwards are said to be in the anti conformation (1ldm, 1dlj, 1dss) and those with the amide group pointing upwards are said to be in the syn conformation (1deh, 1dxy, 1bxg).

The previous section showed that, within superfamilies, the sequential relationships of Rossmann and catalytic domains are the same. In this section we are concerned with the geometrical relation of the domains and connections. We discuss, for each superfamily, the similarities in relative positions of their domains, the topology of the links between them and the extent to which the structure of the links is conserved. When discussing the links between the domains it is useful to have two positions of reference that are common to different Rossmann domains. The ®rst of these is at, or near, the N terminus: the ®rst conserved site in the ®rst b-strand. The second reference position is the last conserved site in the sixth b-strand. This position is near the C terminus in some structures and somewhat prior to it in others. In lactate dehydrogenase (1ldm) these sites are at positions 22 and 161. We will refer to the regions between these reference positions and the ®rst (or last) residue in the catalytic domain as the linker regions. First we discuss the similarity of linker regions in the different families that belong to a single superfamily. Superfamily 2 is formed by a cluster of ®ve families; superfamilies 4, 5 and 7 are each formed

The Geometry of Domain Combination in Proteins

Figure 2. Domain organisation of representatives from each of the 17 families used in this study. Rossmann domains are shown in grey, marked R, neighbouring catalytic domains in white, marked CAT, and other domains in black, marked O. The small numbers indicate domain boundaries. Large numbers indicate superfamilies (see Table 1).

931

Figure 3. Structures for a representative of each of the three families in superfamily 7 showing domain linkage between the two domains. The Rossmann domain is indicated in yellow, the neighbouring domain in white and the linkage in black. NAD is shown as a space-®lling model. Structures used here are indicated by the PDB identi®er shown adjacent to rendition.

932 by clusters of three families, and superfamilies 1, 3 and 6 have only one family. Taking superfamily 7 as an example, we ®rst compare the structures of one representative of each of its three families Ribbon drawings of the three representative structures are shown in Figure 3. The catalytic domains are shown in white, the core of the Rossmann domains in yellow and linker regions in black. As noted above, the connection in this superfamily is type IV: the Rossmann domains are inserted within the catalytic domains. This means that there are two linker regions: the ®rst from the catalytic domain to the N terminus of the Rossmann domain and the second from its C terminus back to the catalytic domain (Figure 3). Inspection of Figure 3 shows that in all three representative structures the relative positions of the two domains are essentially the same. Also the interfaces between the two domains are very similar in the three structures: they are formed by two coiled regions of polypeptide and the surfaces burÊ 2. ied in the interface are between 700-950 A The linker regions that connect the domains (black ribbons) follow the same general path in the three structures. However, they differ in length and local conformation. The lengths of the ®rst and second connections in the three structures are 23 and 14 residues (1b3r), 35 and 8 (1pjc) and 45 and 41 (1dxy). The ®rst connection begins with an ahelix in each structure but after that the local conformations are quite different. The three second connections have little similarity in local conformation (Figure 3). Examination of the other three superfamilies that have more than one family gives very similar results: (i) within a given superfamily the two domains have the same relative position (though in different superfamilies the actual positions are different; see below); (ii) the interfaces between domains are very similar in structure, and area; (iii) the linker regions that connect the domains follow the same general path; and (iv) the connections vary somewhat in their length and local conformation. In fact the differences in length and local conformation seen in the other superfamilies are smaller than those in superfamily 7. The linking regions in one structure from each of superfamilies 2, 4 and 5 are shown in Figure 4. In Superfamily 2, the single linker region occurs after the sixth strand of the Rossmann domain and in the different families consists of a conserved b-a-b unit preceded by a short region that has a variable conformation. In superfamily 4, a single helix crosses over the Rossmann b-sheet from the catalytic domain and either joins directly to the ®rst strand of the Rossmann or, in some families, through a second short helix. In superfamily 5, where the two linker regions between the Rossmann and catalytic domains are both short, partly buried and near the active site. In the different families the ®rst connection has only small variations in size and conformation and the second has little or none.

The Geometry of Domain Combination in Proteins

Superfamilies 1, 3 and 6 each have just one family. Within these families (and within the families of the other superfamilies), the variations in the size and conformation of the connections are much smaller that those found between families. A particularly extreme case of this is seen in superfamily 1. In this superfamily, the linker region is only two residues (see Figure 4, top left), and it is deeply buried in a large interface. The accessible surface area of the isolated domains that is buried in the interface between the two domains Ê 2 (in of lactate dehydrogenase (1ldm) is 3850 A Ê2 superfamily 7, for example, it is between 700 A 2 Ê (1b3r) and 950 A (1pjc)). This position of the linker region suggests that it is subject to strong structural and functional constraints: the accommodation of any insertion, deletion or large mutation would involve movements at a close packed interface that would be likely to change the relative positions of the domains and thus be detrimental to the active site. This view is supported by an examination of the sequences of homologues of superfamily 1 structures. Homologues were retrieved via a FASTA11 search of NRDB9012 and a multiple alignment was generated with ClustalW.13 Inspection of these sequences shows that no insertion or deletions occur in the region of the domain connection though they do take place in the immediately adjacent regions. It also showed that there is extensive sequence conservation in the region around the buried connection. A classical Rossmann domain that binds NAD in a novel orientation As mentioned above, a referee of this paper pointed out a recently published protein structure that contains a classical Rossmann domain that binds NAD in a novel orientation. The structure in question is the dI component of transhydrogenase from Rhodospirillum rubrum14 a membrane-bound proton pump which is linked to a proton gradient and the equilibrium between NAD(H) and NADP(H). Transhydrogenase has three subunits: dI which binds NAD, dIII which binds NADP(H) and dII the transmembrane component. The structure of dIII15 is described in SCOP as a ``DHS-like NAD/FAD-binding domain''. The structure dI shows that it has a classical Rossmann and is a homologue of L-alanine dehydrogenase:14 297 aligned residues in the two structures have 31 % sequence identity and their main-chain atoms a Ê . However, the rms difference in position of 1.9 A nicotinamide half of NAD in dI has an orientation quite different to that in AlaDH and in other classical Rossmann domains of known structure. Instead pointing towards the second domain it points out into the solvent (Figure 5(a)). The makes the nicotinamide ring accessible for interaction with dIII.14 This alteration in the conformation of NAD is the result of at least two differences between AlaDH and dI. First, dI has loop not present in

The Geometry of Domain Combination in Proteins

933

Figure 4. The relative positions of the Rossmann and catalytic domains and the linker regions in one represenative of each of the seven superfamilies (Table 1). The Rossmann domain is indicated in yellow the catalytic domain in white and the linking regions in black. Space ®lling models of the co-enzyme are shown. This is NAD‡ in all structures except SF3, where it is CoA. SF, Superfamily, structures PDB identi®er is shown adjacent to rendition, roman numerals indicate the domain linkage type.

AlaDH that stabilises the novel conformation of NAD.14 Second, some of the residues that would pack against nicotinamide end of NAD if it had the same conformation as that in AlaDH, have volumes that are different to those in AlaDH.

Although the ®rst domain in dI does not, of course, have catalytic function as it has in AlaDH, it very probably does play a functional in being part of the binding site for dIII.14

934

The Geometry of Domain Combination in Proteins

Figure 6. The range of positions for catalytic domains that allows them to bring the substrate close to the nicotinamide ring of NAD‡. The catalytic domains (CAT) are represented by grey spheres. The Rossmann domain by its six b sheet strands.

Domain Recombination and protein function

Figure 5. (a) The Rossmann domain dI component of transhydrogenase (1f8g)14 showing NAD from dI in green and NAD from the homologous protein L-alanine dehydrogenase (AlaDH) (1pjc)31 in red. The orientation of NAD in AlaDH is the same as that in all other classical Rossmann domains. Residues in green (221-240) indicated those in an inserted loop (relative to AlaDH) which make contacts to NAD and stabilise it in the novel conformation. Yellow indicates sites whose residues in AlaDH make contacts to NAD and which have side chains with different volumes in dI. (b) Transhydrogenase dI showing the Rossmann domain in yellow, linking regions in black and catalytic domain in white. NAD is shown in its novel orientation. Compare this to the drawing of AlaDH (1pjc)31 in Figure 3.

The general geometrical requirements for the position of the substrate relative to NAD‡ were brie¯y described above: the reactive group has to be below the plane of the nicotinamide ring and close to its C4 atom (Figure 1(b)). This, of course, places constraints on the relative positions of the Rossmann and catalytic domains. A priori, we might expect the range of possible orientation as illustrated by the schematic diagram shown in Figure 6. If the catalytic domains are roughly the same size as the Rossmann domains, we would expect their centres would occupy positions that, from the viewpoint of this Figure, form an arc of roughly 90  around the nicotinamide ring. The position of a particular catalytic domain would, of course, be determined by the position of its binding site and orientation of the substrate in that site. In Figure 4 we show, for one representative of each of the seven superfamilies, the relative positions of the Rossmann and catalytic domains. The positions of the catalytic domains in the different superfamilies cover the range of expected positions though with more at the extremities of the range than in the middle. Only in the case of superfamily 1 does a short connection determine the relative position of the domain. In the other six superfamilies the relative position of the domains is constrained neither by the type of the connection nor by it its length (Figure 4). For example, superfamilies 2, 4, 6 and 7 have their two domains in very similar relative

The Geometry of Domain Combination in Proteins

positions and have connections of type I, II and IV, respectively. In the case of AlaDH and dI, we see how two domains can conserve their overall geometry but radically change the nature of the functional relationship. In AlaDH the ®rst domain is the catalytic partner of a second cofactor binding domain; in dI the ®rst domain helps bind an other subunit with which the second interacts. Before discussing the implication of this analysis of Rossmann and catalytic domains we will brie¯y describe the results of the analysis of the second set of proteins. Proteins with the pairs of the same domains linked together in different sequential order In the previous part of the paper we discussed Rossmann domains linked to catalytic domains from seven different superfamily. We found that, though different superfamilies have different types of connection with Rossmann domains, those made by a particular superfamilies are all of the same type. Investigations by Apic et al.5 showed previously that this is generally true. For the proteins whose structure was known at the time of their work, there were 356 different pairs of superfamily-superfamily domain combinations. For 348 of the pairs, the sequential order of the two domains is always the same in the different proteins in which they are found. For eight pairs both

935 type I and type II connection were observed. Thus, if proteins are from superfamilies A and B, they are found in 348 cases in the sequential order AB or BA but in eight cases they are found in the order AB and BA. Prior to this work Todd et al.49 described two of these cases. These eight cases are listed in Table 2. The two cases described by Todd et al.49 are the ®fth and the eight in this Table. The ®rst seven of the cases involve: two where AB and BA combinations are observed; two where we ®nd AB and ABA combinations, and three where just the ABA combination is found. The eighth case is different to the rest and is discussed after these seven. We examined the structures of these proteins, and the papers that describe these structures, to answer questions on the geometry and function of the domains. In the cases where the sequential order is reversed we wanted to know (i) whether or not the domains are in the same relative orientation (if the connections are long enough the domains can, of course be put in the same relative positions as we have seen above) and (ii) do they have related or different functions. In cases where the ABA combinations are found we wanted to know the function of the two A domains. For the seven cases, the same general answer can be given to these questions. In two AB and BA cases (see Table 2), the relative positions of the domains in the AB combination is different to that in the BA combination. This means, as might be

Figure 7. Structures that contain the same pair of domains linked in both the AB and BA combinations. (a) In red Eukarotic transcription initiation factor 5A34 (eIF-5A), (2eif). In blue Ribosomal protein L233 (1rl2). Both structures have an OB-fold domain and an SH3 domain. In eIF-5A the SH3 is N terminal and in L2 it is C-terminal region. In eIF-5A Lys 40 (which is post translationlay modi®ed to hypusine) shown in green and adjacent conserved residues in yellow. eIF-5A is implicated in RNA binding and the hypusine modi®cation is known to be required for function.34,48 In L2, the sites involved in RNA binding are highlighted in yellow. In (b) sul®te reductase,45 (1aop). The structure has an A1B1A2B2 domain order, the domains are coloured red, green, yellow, and blue respectively. The two B domains are homologous but bind different co-factors (see the text).

936

Table 2. Proteins that contain domain pairs that are linked together in different sequential order

The Geometry of Domain Combination in Proteins

Proteins that contain domain pairs that are linked together in different sequential order. A and B are the domain pair in question X indicates other domains not in the pair but part of the same polypeptide chain. Number next to protein name indicates reference for that structure.

937

The Geometry of Domain Combination in Proteins

expected, that the functions of the two combinations are different: see Figure 7 and Table 2 for short descriptions of these different functions. (For full descriptions see the references listed in Table 2.) In the ABA combinations the two A domains must, of course, be in different positions. In the ®ve structures where they occur they also have different functions: these are brie¯y described in Table 2, the case is illustrated in Figure 7(b), and full details are given in the references listed in the Table 2. The eighth case is quite different to the other seven. Prokaryotic glutathione synthetase has two distinct AB domains. In the eukaryotic enzyme a cyclic permutation of the gene has transferred the last one third of the B domain to the N terminus of the A domain to produce a B0 AB00 where B0 is the C-terminal part of the prokaryotic B domain and B00 is the other part. This cyclic permutation was also described by Todd et al.49 In the prokaryotic structure the N and C termini are close together and this change in the domain links can take place without producing a change in their relative positions.

Discussion Here we have analysed domain connections, and their relation to function, in two sets of proteins. The context in which to examine the implications of the results of this analysis is the work on domain combinations carried out by Apic et al.5 They found that, in the protein families that make combinations, most do so with just one or two other families; a few families make many combinations, and that particular superfamily-superfamily combinations almost always combine in the same sequential order. The domain combinations formed by superfamily of the classical Rossmann domains with seven different superfamilies of catalytic domains have all these general features. The seven superfamilies of the catalytic domains combine with only one other superfamily (the members of the Rossmann superfamily); conversely, the Rossmann superfamily itself is an example of one of the few that combines with many other superfamilies. The different sequential orders can be found for Rossmann domains combined with catalytic domains from different superfamilies but, in combination with catalytic domains from one superfamily, the order is always the same. Analysis of genome sequences extends these observations: Rossmann domains are found in combinations not in the current structure database and for particular combinations the order is conserved.5 The analysis of the Rossmann domains and their partners show that there is no general relationship between the sequential order of domains and the positioning of domains required for function. When the linking regions are long, as in most of the structures considered here, they allow similar relative positions to be achieved with all four types

of connections. Only in superfamily 1, where the link between domains is buried and very short, is there a direct simple relationship between the type of domain connection and function. In all cases where Rossmann domains are linked to the members of one catalytic superfamily the domains are in the same relative position, their interfaces have the same structure and their linker regions follow the same path. The local changes in conformation that occur in the linker regions within superfamilies 2, 4, 6, and 7 are typical of those found in peripheral surface loops of distantly related homologous proteins. Conversely, the strong conservation of the linker regions in superfamily 1 is typical of regions deeply buried in the structure. These results indicate that the reason why members of a superfamily of catalytic domains conserve the sequential order of their connections to Rossmann domains is that each pairing arose from a single recombination event and the superfamily arose from subsequent duplications and divergence of this pair. This is likely to be true of many other superfamily-superfamily domain combinations. In the cases where both sequential orders for a pair of domains are found, i.e. AB and BA or ABA, the relative positions of the domains in AB and BA combinations are quite different (with one exception). This difference in position, and natural selection, gives the two domains different functional relationships. The only known exception is bacterial and eukaryotic glutathione synthetases where the particular geometry of the domain association allows a cyclic permutation of the gene without a change in the relative position or function of the domains.

Acknowledgements We thank Goga Apic, Sarah Teichmann, Alexey Murzin and Arthur Lesk for discussions and information and our colleagues for advice.

References 1. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974). Chemical and biological evolution of nucleotidebinding protein. Nature, 250, 194-199. 2. Teichmann, S. A., Park, J. & Chothia, C. (1998). Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl Acad. Sci. USA, 95, 14658-14663. 3. Teichmann, S. A., Chothia, C. & Gerstein, M. (1999). Advances in structural genomics. Curr. Opin. Struct. Biol. 9, 390-399. 4. Park, J., Lappe, M. & Teichmann, S. A. (2001). Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol. 307, 929-938.

938 5. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311-325. 6. Lesk, A. M. (1986). Integrated access to sequence and structural data. In Biosequences: Perspectives and user services in Europe (Saccone, C., ed.), pp. 23-28, EEC, Bruxelles. 7. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classi®cation of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540. 8. Brenner, S. E., Koehl, P. & Levitt, M. (2000). The ASTRAL compendium for sequence and structure analysis. Nucl. Acid Res. 28, 254-256. 9. Berman, H. M., Westbrook, J., Feng, Z., Gillil, G., Bhat, T. N., Weissig, H. et al. (2000). The Protein Data Bank. Nucl. Acid Res. 28, 235-242. 10. Branden, C. I. (1980). Relation between structure and function of alpha/beta-proteins. Quart. Rev. Biophys. 13, 317-338. 11. Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444-2448. 12. Holm, L. & Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14, 423-429. 13. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucl. Acid Res. 22, 46734680. 14. Buckley, P. A., Baz, Jackson J., Schneider, T., White, S. A., Rice, D. W. & Baker, P. J. (2000). Protein-protein recognition, hydride transfer and proton pumping in the transhydrogenase complex. Struct. Fold. Des. 8, 809-815. 15. Jeeves, M., Smith, K. J., Quirk, P. G., Cotton, N. P. & Jackson, J. B. (2000). Solution structure of the NADP(H)-binding component (dIII) of proton-translocating transhydrogenase from Rhodospirillum rubrum. Biochim. Biophys. Acta, 1459, 248-257. 16. Abad-Zapatero, C., Grif®th, J. P., Sussman, J. L. & Rossmann, M. G. (1987). Re®ned crystal structure of dog®sh M4 apo-lactate dehydrogenase. J. Mol. Biol. 198, 445-467. 17. Adams, M. J., Gover, S., Leaback, R., Phillips, C. & Somers, D. O. (1991). The structure of 6-phosphogluÊ resolution. conate dehydrogenase re®ned at 2.5 A Acta Crystallog. sect. B, 47, 817-820. 18. Thomazeau, K., Dumas, R., Halgand, F., Forest, E., Douce, R. & Biou., V. (2000). Structure of spinach acetohydroxyacid isomeroreductase complexed with its product of reaction dihydroxy-methylvalerate, manganese and ADP-ribose. Acta Crystallog. sect. D, 56, 389-397. 19. Barycki, J. J., O'Brien, L. K., Bratt, J. M., Zhang, R., Sanishvili, R. & Strauss, A. W., et al. (1999). Biochemical characterization and crystal structure determination of human heart short chain L-3-hydroxyacyl-CoA dehydrogenase provide insights into catalytic mechanism. Biochemistry, 38, 5786-5798. 20. Campbell, R. E., Mosimann, S. C., Van De Rijn, I., Tanner, M. E. & Strynadka, N. C. J. (2000). The ®rst structure of UDP-glucose dehydrogenase reveals the catalytic residues necessary for the two-fold oxidation. Biochemistry, 39, 7012-7023.

The Geometry of Domain Combination in Proteins 21. Britton, K. L., Asano, Y. & Rice, D. W. (1998). Crystal structure and active site location of N-(1-D-cardehydrogenase. Nature boxylethyl)-L-norvaline Struct. Biol. 5, 593-601. 22. Fraser, M. E., James, M. N. G., Bridger, W. A. & Wolodko, W. T. (1999). A detailed description of the structure of succinyl-CoA synthetase from Escherichia coli. J. Mol. Biol. 285, 1633-1653. 23. Vanhooke, J. L., Thoden, J. B., Brunhuber, N. M., Blanchard, J. S. & Holden, H. M. (1999). Phenylalanine dehydrogenase from Rhodococcus sp. M4: high-resolution X-ray analyses of inhibitory ternary complexes reveal key features in the oxidative deamination mechanism. Biochemistry, 38, 2326-2339. 24. Shen, B. W., Dyer, D. H., Huang, J-Y., D'Ari, L., Rabinowitz, J. & Stoddard, B. L. (1999). The crystal structure of a bacterial, bifunctional 5,10 methylenetetrahydrofolate dehydrogenase/cyclohydrolase. Protein Sci. 8, 1342-1349. 25. Yang, Z., Floyd, D. L., Loeber, G. & Tong, L. (2000). Structure of a closed form of human malic enzyme and implications for catalytic mechanism. Nature Struct. Biol. 7, 251-257. 26. Vellieux, F. M., Hajdu, J., Verlinde, C. L., Groendijk, H., Read, R. J., Greenhough, T. J. et al. (1993). Structure of glycosomal glyceraldehyde-3-phosphate dehydrogenase from Trypanosoma brucei determined from Laue data. Proc. Natl Acad. Sci. USA, 90, 23552359. 27. Delabarre, B., Thompson, P. R., Wright, G. D. & Berghuis, A. M. (2000). Crystal structures of homoserine dehydrogenase suggest a novel catalytic mechanism for oxidoreductases. Nature Struct. Biol. 7, 238-244. 28. Rowland, P., Basak, A. K., Gover, S., Levy, H. R. & Adams, M. J. (1994). The three-dimensional structure of glucose 6-phosphate dehydrogenase from Ê resolution. Leuconostoc mesenteroides re®ned at 2.0 A Structure, 2, 1073-1087. 29. Davis, G. J., Bosron, W. F., Stone, C. L., OwusuDekyi, K. & Hurley, T. D. (1996). X-ray structure of human b3b3 alcohol dehydrogenase - the contribution of ionic interactions to coenzyme binding. J. Biol. Chem. 271, 17057-17061. 30. Dengler, U., Nie®nd, K., Kiess, M. & Schomburg, D. (1997). Crystal structure of a ternary complex of D2-hydroxyisocaproate dehydrogenase from LactobaÊ rescillus casei, NAD‡ and 2-oxoisocaproate at 1.9 A olution. J. Mol. Biol. 267, 640-660. 31. Baker, P. J., Sawa, Y., Shibata, H., Sedelnikova, S. & Rice, D. W. (1998). Analysis of the structure and substrate binding of P. lapideum alanine dehydrogenase. Nature Struct. Biol. 5, 561-567. 32. Hu, Y., Komoto, J., Huang, Y., Gomi, T., Ogawa, H., Takata, Y. et al. (1999). Crystal structure of S-adenosylhomocysteine hydrolase from rat liver. Biochemistry, 38, 8323-8333. 33. Nakagawa, A., Nakashima, T., Taniguchi, M., Hosaka, H., Kimura, M. & Tanaka, I. (1999). The three-dimensional structure of the RNA-binding domain of ribosomal protein L2; a protein at the peptidyl transferase center of the ribosome. EMBO J. 18, 1459-1467. 34. Kim, K. K., Hung, L. W., Yokota, H., Kim, R. & Kim, S. H. (1998). Crystal structures of eukaryotic translation initiation factor 5A from Methanococcus Ê resolution. Proc. Natl Acad. Sci. jannaschii at 1.8 A USA, 95, 10419-10424.

939

The Geometry of Domain Combination in Proteins 35. Tews, I., Perrakis, A., Oppenheim, A., Dauter, Z., Wilson, K. S. & Vorgias, C. E. (1996). Bacterial chitobiase structure provides insight into catalytic mechanism and the basis of Tay-Sachs disease. Nature Struct. Biol. 3, 638-648. 36. Kim, J. S., Cha, S. S., Kim, H. J., Kim, T. J., Ha, N. C., Oh, S. T. et al. (1999). Crystal structure of a maltogenic amylase provides insights into a catalytic versatility. J. Biol. Chem. 274, 26279-26286. 37. Jain, S., Drendel, W. B., Chen, Z. W., Mathews, F. S., Sly, W. S. & Grubb, J. H. (1996). Structure of human beta-glucuronidase reveals candidate lysosomal targeting and active-site motifs. Nature Struct. Biol. 375, 375-381. 38. Jacobson, R. H., Zhang, X. J., DuBose, R. F. & Matthews, B. W. (1994). Three-dimensional structure of beta-galactosidase from E. coli. Nature, 369, 761766. 39. Jures, D. H., Hubber, R. E. & Matthews, B. W. (1999). Structural comparisons of TIM barrel proteins suggests functional and evolutionary relationships between b-galactosidase and other glycohyrolases. Protein Sci. 8, 122-136. 40. Xu, W., Doshi, A., Lei, M., Eck, M. J. & Harrison, S. C. (1999). Crystal structures of C-Src reveal new features of its autoinhibitory mechanism. Mol. Cell, 3, 629-638. 41. Maignan, S., Guilloteau, J. P., Fromage, N., Arnoux, B., Becquart, J. & Ducruix, A. (1995). Crystal structure of the mammalian Grb2 adaptor. Science, 268, 291-293. 42. Muller, A. M., Lindqvist, Y., Furey, W., Schulz, G. E., Jordan, F. & Gunter, S. (1993). A thiamin diphopshate binding fold revealed by comparison of the

43.

44.

45.

46.

47.

48.

49.

crystal structures of transkelolase pyruvate oxidase and pyruvate decarboxylase. Structure, 1, 95-103. Dyda, F., Furey, W., Swaminathan, S., Sax, M., Farrenkopf, B. & Jordan, F. (1993). Catalytic centers in the thiamin diphosphate dependent enzyme pyrÊ resolution. Biochemisuvate decarboxylase at 2.4-A try, 32, 6165-6170. Garrett, T. P. J., Mckern, N. M., Lou, M., Frenkel, M. J., Bentley, J. D., Lovrecz, G. O. et al. (1998). Crystal structure of the ®rst three domains of the type-1 insulin-like growth factor receptor. Nature, 394, 395-399. Crane, B. R., Siegel, L. M. & Getzoff, E. D. (1995). Ê : evolution and Sul®te reductase structure at 1.6 A catalysis for reduction of inorganic anions. Science, 270, 59-67. Matsuda, K., Mizuguchi, K., Nishioka, T., Kato, H., Go, N. & Oda, J. (1996). Crystal structure of glutathione synthetase at optimal pH: domain architecture and structural similarity with other proteins. Protein Eng. 9, 1083-1092. Polekhina, G., Board, P. G., Gali, R. R., Rossjohn, J. & Parker, M. W. (1999). Molecular basis of glutathione synthetase de®ciency and a rare gene permutation event. EMBO J. 18, 3204-3213. Xu, A. & Chen, K. Y. (2001). Hypusine is required for a sequence-speci®c interaction of eukaryotic initiation factor 5A with postsystematic evolution of ligands by exponential enrichment RNA. Proc. Natl Acad. Sci. USA, 276, 2555-2561. Todd, A. E., Orengo, C. A. & Thornton, J. M. (2001). Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113-1143.

Edited by J. Thornton (Received 13 August 2001; received in revised form 19 November 2001; accepted 26 November 2001)