Biosensors & Bioelectronics 9 (1994) 753-760
Structural aspects of biomolecular recognition and self-assembly Richard N . Perham Cambridge Centre for Molecular Recognition, Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1QW, UK
Abstract : Proteins are capable of fulfilling two important features of any likely system of bioelectronics : the ability to recognise other molecules with exquisite specificity, and the ability to self-assemble, in vivo and in vitro, to generate an astonishing variety of three-dimensional structures . Much current work is aimed at the redesign of existing proteins, either as an end in itself or as a means of developing the knowledge-base necessary for the ab initio design of novel proteins . This type of study has been greatly facilitated by the discovery of the modular or domain structure of many proteins, leading to concepts of protein manipulation as a kind of molecular Lego . Keywords : protein engineering, protein folding, domains, sequence motifs, symmetry)
INTRODUCTION Many biological structures are capable of selfassembly, i. e. of spontaneous formation from one or more different types of protein subunit that aggregate to yield the complex tertiary and quaternary structures that characterize living matter . In some instances, it is possible to reconstruct the process of self-assembly using the appropriate isolated proteins in vitro ; in others this has proved impossible to achieve (or has been achieved only poorly), and assembly has to be studied in vivo . In both types of system, however, the fact that large organized structures are often made by packing identical protein subunits with identical bonding patterns means that the final structure will then necessarily be a symmetrical one . Since proteins are made of chiral amino acids, biological structures will always be different from their mirror images. 0956-5663/94 $07,000 1994 Elsevier Science Ltd .
This leaves rotations and translations as the only permitted spatial symmetry operations, which in turn implies that all such structures must be represented by the enantiomorphic line, point, plane and space groups . This was developed as a concept most notably by Caspar and Klug (1962) and, with further additions and refinements, has subsequently served as the basis of codifying and rationalizing a myriad of biological structures . Another general feature of protein structure to emerge more recently is the existence of domains as semi-autonomous folding units of three-dimensional structure . Many proteins appear to be composed of a mosaic of different domains-generally no larger than about 100 or so amino acid residues in size-assembled in different permutations and combinations to create the vast array of naturally occurring proteins (Bork, 1991 ; Branden & Tooze, 1991 ; Creighton, 1993) . This constitutes an economical way of 753
Richard N. Perham
generating complexity and has been invoked as a mechanism for the evolution of proteins from a set of simple starting points (see, for example, Dorit et al ., 1990) . In what follows, I propose to draw on these two guiding principles in protein structure and illustrate their utility in studies of protein design and redesign . Our ability to manipulate protein structure will almost certainly form an essential part of any usage to which proteins may be put in the field of molecular electronics .
Biosensors & Bioelectronics a)
wild type b)
.00
Wild type + cloned DNA insert c)
LINE SYMMETRY AND HELICAL STRUCTURES Line symmetry, the head-to-tail assembly of monomers, is generally exemplified in biology in the widespread occurrence of helical structures . This was realised first in the analysis of the structure of simple helical viruses, notably tobacco mosaic virus (for reviews, see Holmes, 1982 and Butler, 1984) . Tobacco mosaic virus also exhibits another aspect of line symmetry that has only recently come to be fully appreciated in other connexions : a helical arrangement of identical subunits is infinitely extendable if nothing acts to terminate it . Thus, whereas tobacco mosaic virus is a rod 300 nm long, the coat protein subunits of the virus can aggregate in the absence of the RNA component to form rods with identical helical parameters but widely differing lengths . It is the interaction of the protein subunits with the RNA molecule that ensures that the growth of the protein helix stops when the RNA has been totally encapsidated, thereby generating virus particles that are all the same length . The RNA can thus be envisaged as acting as a molecular tape measure . This principle has been recognized and extended in several different instances . Filamentous bacteriophage fd is another helical rod (Marvin et al., 1994), albeit more flexible than tobacco mosaic virus, in which a single-stranded DNA molecule is encapsidated in a hollow tube, about 1 µm long, comprised of 2700 coat protein subunits arranged with helical symmetry (for a review, see Model and Russel, 1988) . Its length is also dictated by the length of the DNA, for if extra DNA is inserted into the genome the length of the viral filament grows pro rata without any other effect on the virion (Fig . 1) . This property is the basis of the widespread use of the virus as 754
%Q%ft*%b %%j
q%.
K48A mutant
Fig. 1. Schematic models (not to scale) of the structure of three forms of filamentous bacteriophage fd : (a), the wild-type virion; (b) the virion with extra DNA cloned into a non-coding region (the inter-genic space) of the DNA ; (c) the virion with wild-type DNA apart from a mutation (K48A) in gene VIII that replaces the lysine residue at position 48 in the major coat protein with alanine. In (a), the single-stranded (non-base-paired) DNA molecule is encased in a sheath of 2700 copies of the major coat protein subunit, itself a single ahelix, with a few copies of each of two minor coat proteins located at one end of the virion and of two different minor coat proteins at the other end . In all cases, the assembly of the coat proteins to form the viral capsid is essentially the same . In (a) and (c), the contour lengths of the single-stranded DNA are identical but the DNA molecules occupy different lengths ; in (b) the contour length of the DNA is larger, owing to the extra DNA that has been inserted. In (b) and (c), more copies of the major coat protein are required to form the correspondingly longer sheaths.
a cloning and DNA sequencing vector . In the encapsidation process, the negative charges of the sugar-phosphate backbone of the DNA are neutralized by the positive charges of the Cterminal domains of the coat protein subunits that line the inside of the growing filament . This positive charge density can be lowered by replacing the positively charged side chains of certain lysine residues with those of neutral amino acids, and the viral filament formed then becomes commensurately longer (Hunter et al., 1987) . It would appear that lowering the positive charge density per unit length inside the helical tube forces the DNA to increase the length it occupies by adopting a more elongated confor-
Biosensors & Bioelectronics mation, thereby lowering its negative charge density per unit length in a matching process (Fig . 1) . The length of the virion is still dictated by the length of the DNA molecule, but we have been able to manipulate the scale by which it is read (Greenwood et al., 1991) . It should not be thought that only DNA or RNA molecules can act in this capacity : it has been shown that a large, mostly a-helical protein, the product of the viral H gene, serves to delimit the length of the helical array of other proteins that constitute the tail of bacteriophage A (Katsura, 1987) ; and giant filamentous proteins, titin and nebulin, appear to act as protein rulers that define that assembly of the thick and thin filaments, respectively, of muscle from their component protein subunits (Labeit et al., 1991) . This mode of limiting the assembly of structures with line symmetry is clearly widespread, and we shall doubtless come across further examples of it in the future .
POINT GROUP SYMMETRY For a particle of finite extent, the only symmetry possible is that of the point groups : the number of protein subunits in any such structure is fixed . There are three types of point group . Cyclic symmetry As in line symmetry, the subunits are arranged head to tail, but here they are constrained into a circular array . There is no restriction on the number of subunits, though it is generally small (6 or fewer) and can be even or odd . Typical biological examples are the enzymes glutamine synthetase (a 6-fold axis) and the catalytic subunit of aspartate carbamoyl transferase (a 3-fold axis) (Ke et al., 1988) . Dihedral symmetry A 2-fold axis can be combined at right-angles with any n-fold axis . The number of subunits, 2n, must be even . Most oligomeric enzymes are dimers or tetramers of dihedral symmetry . This is readily explained by recognising that symmetrical dimers are easily generated as a result of hydrophobic interactions between the participating monomers, and the dimers can then similarly associate to form tetramers.
Structural aspects of biomolecular recognition Cubic symmetry There are three types of of cubic symmetry : tetrahedral (23), which demands 12 subunits ; octahedral (432), which demands 24 subunits ; and icosahedral (532), which demands 60 subunits. Octahedral symmetry is exemplified by the dihydrolipoyl acyltransferase (E2) component of the pyruvate dehydrogenase multienzyme complex of Gram-negative bacteria, such as Escherichia coli, which is formed of 24 E2 polypeptide chains . Icosahedral symmetry is exemplified by the E2 component of the pyruvate dehydrogenase complex of Gram-positive bacteria, such as Bacillus stearothermophilus, and eukaryotic mitochondria, which comprises 60 E2 polypeptide chains . These have been extensively reviewed elsewhere (Reed & Hackert, 1990; Perham, 1991), and a high-resolution structure of the octahedral E2 component of the pyruvate dehydrogenase complex from Azotobacter vinelandii has recently been determined (Mattevi et al., 1993) . The amino acid sequences of E2 polypeptide chains that fall into the two different symmetry classes exhibit considerable similarity (Perham, 1991 ; Russell & Guest, 1991), and it is evident that two different pathways of assembly have evolved from a common precursor . In fact, relatively little is required, at least in principle, to convert an octahedral assembly into an icosahedral one, and a major objective now must be to learn the rules for the rational redesign of such a process . However, it is in the structures of `spherical' viruses that icosahedral symmetry is most commonly observed (Caspar & Klug, 1962) . In the simplest instances, for example bacteriophage ¢X174 (McKenna et al ., 1992), the virus capsid is constructed of 60 copies of the coat protein subunit . A bigger virus can be constructed if a bigger coat protein subunit is used, as with canine parvovirus (Tsao et al., 1991), but the need to encapsidate larger genomes is more often met by breaking out of the restrictions of strict icosahedral symmetry . If we allow systematic deformations of the bonds between the protein subunits in a number of slightly different environments, the subunits are no longer strictly equivalent . This concept of quasi-equivalence, in which the number of permissible protein subunits in the capsid is 60T, where T has a limited range of values (1, 3, 4, 7, etc.), has proved to be enormously useful in characterizing and understanding the structures of numerous spherical 755
Richard N. Perham viruses (Caspar & Klug, 1962) . An apparent exception to the theory was the polyomaviruses, the capsid of which comprises 360 copies of a major protein arranged as 72 pentamers on the viral surface, but this can satisfactorily be resolved by the observation from a high-resolution structural analysis that 12 pentamers lie on the 12 five-fold rotation axes of the icosahedron, and the remaining 60 pentamers are 6-coordinated, with three kinds of inter-pentamer clustering (Liddington et al., 1991) .
DOMAINS AND SEQUENCE MOTIFS The three-dimensional structures of many proteins extant today suggest that they have evolved by permutation and combination of a relatively small number of domains or modules that became adapted to perform different functions in different settings (Bork, 1991 ; Branden & Tooze, 1991 ; Creighton, 1993) . It is likely that there are perhaps only 105 different proteins in the normal repertoire of nature . Based on the comparison of the recurrence of protein folding motifs in widely differing situations and our growing knowledge of the DNA sequences of the genomes of several different organisms, it can be estimated that there are perhaps only 103 `master' domains in the pool from which these structures are drawn (Chothia, 1992) . The ability to recognise structural motifs in proteins is growing . Some of them are selfevident, given the high level of amino acid sequence similarity that they exhibit in different proteins (Branden & Tooze, 1991) . In other instances, however, the relationship may have become obscured by a long period of divergent evolution, and limited only by a need for certain amino acids to occupy particular sites in order to comply with a set of structural constraints that defines a particular fold . For example, the lipoyl domain of the E2 chain of 2-oxo acid dehydrogenase complexes (see above) and the biotinyl domain of biotin-dependent carboxylases exhibit only vestigial sequence similarity . Nonetheless, they have been predicted to have similar three-dimensional structures, given the conservation of certain amino acids in key positions (Brocklehurst & Perham, 1993) . Likewise, a 30residue sequence motif predicted (Hawkins et al., 1989) to participate in the binding of the cofactor thiamin diphosphate has been found to 756
Biosensors & Bioelectronics adopt a common fold in the crystal structures of three different and unrelated enzymes that require it for catalysis (Muller et al ., 1993) . Folding domains are generally at their most obvious in multifunctional proteins, from which they can often be released by limited proteolysis (Bork, 1991 ; Perham, 1991) . The biological function associated, in whole or part, with a given domain is then easy to assess, and the independence of the folding unit is self-evident . With many proteins, however, the domains have become intimately incorporated into the threedimensional structure of the protein, making important contacts with other component parts of the overall structure . In such instances, the domain cannot be released by limited proteolysis, and the autonomy of the domain is inferred from the frequency of the occurrence of its characteristic folding topology in different proteins (Branden, 1990) . A typical example of this is one of the first domains to be described : the dinucleotide-binding domain (Rossmann fold) found in most dehydrogenases (Rossmann et al., 1975 ; Wierenga et al ., 1985) . Its role in binding the coenzyme and dictating the specificity of the interaction is now well established (Scrutton et al., 1990 ; Baker et al., 1992; Mittl et al ., 1993) . In the homodimeric enzyme glutathione reductase, each subunit consists of four such welldelineated domains : an FAD-binding domain and an NADPH-binding domain (both Rossmann folds), followed by a smaller central domain and an interface domain that generates the major part of the subunit interface (Karplus & Schulz, 1987) . This interface domain is stabilized by a central 5-fold (3-pleated sheet, and in the intact enzyme is buried to a large extent by the other three domains (Fig . 2) . These contacts with other domains would leave behind unsatisfied hydrophobic areas on the surface of the interface domain if it were generated in an independent form, not by limited proteolysis but by expression in vivo of a suitable sub-gene encoding it . However, by identifying these contact regions, notably those with the FAD-binding domain and the NADPH-binding domain, and amending their hydrophobic nature by the suitable introduction of charged and hydrophilic amino acid residues (Fig . 3), it has proved possible to generate a soluble domain whose general folding appears undisturbed by the changes wrought in it, and which is still capable of specific dimerization (Leistler & Perham, 1994) .
Biosensors & Bioelectronics
Structural aspects of biomolecular recognition
Fig. 2. Domain structure of Escherichia coli glutathione reductase, viewed perpendicular to the two fold symmetry axis of the dimer. The interface domain is depicted in black, the FAD-binding domain is depicted in white, the NADPH-binding domain is depicted in lighter shading, and the central domain is depicted in darker shading . The sites of the amino acid exchanges required to create the interface domain are indicated by asterisks . The structure was generated by the program MOLSCRIPT (Kraulis, 1991) .
This establishes the autonomous folding of the protein domain and provides direct experimental proof of the possibility of protein evolution by accretion of individual domains, perhaps by a process of gene fusion and exon shuffling (Dorit et al .,1990; Doolittle, 1991) . Interactions between domains would then follow with the evolutionary acquisition of additional hydrophobic surface, a reversal of the process of directed mutagenesis used to create the soluble and tractable interface domain from glutathione reductase (Leistler & Perham, 1994) . From the practical point of view, by following a similar procedure it should be possible to create soluble forms of other folding domains from globular proteins . These would be smaller than their parent proteins and more amenable to some experimental purposes, such as the design and redesign of proteins at the level of individual folding units .
CONCLUSIONS Our ability to manipulate the structure of proteins in rational ways is growing by leaps and bounds . Our knowledge of the mechanism of protein folding is lagging, but it too is rapidly increasing (Creighton, 1992) . If we add to this the encouraging advances being made with the attempts to design simple proteins ab initio (see, for example, Anthony-Cahill et al., 1992), it is clear that we now have in our hands the essential tools for creating novel biomolecular structures for specific purposes . The future looks very promising .
ACKNOWLEDGEMENTS I am grateful to many colleagues for their valuable contributions to the research carried out in my 757
Richard N. Perham
Biosensors & Bioelectronics
Fig. 3 . Structure of the dimeric interface domain of glutathione reductase, viewed as in Fig . 2 . The two contributing domains are depicted in lighter and darker shadings . The seven amino acid residues exchanged (three (I1e339, I1e349 and Va1343) at the contact site with the NADPH-binding domain, and four (Met378, Va1382, Thr383 and Thr384) at the contact site with the FAD-binding domain) are indicated in black . The structure was generated by the program MOLSCRIPT (Kraulis, 1991) .
laboratory and to the Science and Engineering Research Council and The Wellcome Trust for their financial support . I thank Dr Y .N. Kalia for creating the MOLSCRIPT diagrams and Mr C . Fuller for skilled technical assistance .
REFERENCES Anthony-Cahill, S .J., Benfield, P .A ., Fairman, R ., Wasserman, Z.R., Brenner, S.L., Stafford, W.F., III, Altenbach, C ., Hubbell, W .L . & DeGrado, W.F . (1992) . Molecular characterization of helix-loop-helix peptides . Science, 255, 979-983 . Baker, P.J ., Britton, K.L ., Rice, D.W., Rob, A. &
758
Stillman, T.J. (1992) . Structural consequences of sequence patterns in the fingerprint region of the nucleotide binding fold . Implications for nucleotide specificity . J. Mol. Biol., 228, 662-671 . Bork, P . (1991) . Shuffled domains in extracellular proteins . FEBS Lett., 286, 47-54 . Branden, C .I . (1990) . Founding fathers and families . Nature, 346, 607-608 . Branden, C.I . & Tooze, J .E . (1991) . Introduction to Protein Structure . Garland, New York . Brocklehurst, S .M . & Perham, R .N . (1993) . Prediction of the three-dimensional structures of the biotinylated domain from the pyruvate carboxylase of yeast and of the lipoylated H-protein from the glycine cleavage system of pea leaf : a new, automated method for the prediction of protein tertiary structures. Protein Sci., 2, 626-639 .
Biosensors & Bioelectronics Butler, P.J .G . (1984) .The current picture of the structure and assembly of tobacco mosaic virus . J. Gen . Virol., 65, 253-279 . Caspar, D.L.D . & Klug, A . (1962) . Physical principles in the construction of regular viruses . Cold Spring Harbor Symp . Quant. Biol., 27, 1-24 . Chothia, C . (1992) . One thousand families for the molecular biologist. Nature, 357, 543-544 . Creighton, T. E . (ed.) (1992) . Protein Folding . Freeman, New York. Creighton, T.E . (1993) . Proteins. Structures and Molecular Properties . 2nd Edition . Freeman, New York. Doolittle, R .F . (1991) . Counting and discounting the universe of exons . Science, 253, 677-679 . Dorit, R.L ., Schoenbach, L . & Gilbert, W . (1990). How big is the universe of exons? Science, 250, 1377-1382. Greenwood, J ., Hunter, G .J. & Perham, R .N . (1991) . Regulation of filamentous bacteriophage length by modification of electrostatic interactions between coat protein and DNA . J. Mol. Biol., 217, 223-227 . Hawkins, C.F., Borges, A . & Perham, R .N . (1989) . A common structural motif in thiamin pyrophosphate-binding enzymes . FEBS Lett., 255, 77-82 . Holmes, K .C. (1982) . The structure and assembly of simple viruses . In Structural Molecular Biology, Eds. D . B . Davies, W . Saenger, & S .S . Danyluk . Plenum, New York, pp . 475-505 . Hunter, G .J., Rowitch, D .H. & Perham, R .N. (1987) . Interactions between DNA and coat protein in the structure and assembly of filamentous bacteriophage fd . Nature, 327, 252-255 . Karplus, P. A . & Schulz, G .E . (1987) . Refined structure of glutathione reductase at 1 .54 A resolution . J. Mol. Biol ., 195, 701-729 . Katsura, I . (1987) . Determination of bacteriophage A tail length by a protein ruler . Nature, 327, 73-75 . Ke, H., Lipscomb, W .N ., Cho, Y . & Honzatko, R .B . (1988) . Complex of N-phosphonoacetyl-Laspartate with aspartate carbamoyl transferase . J. Mol. Biol., 204, 725-747 . Kraulis, P . J . (1991) . MOLSCRIPT : A program to produce both detailed and schematic plots of proteins structures . J. Appl. Crystallogr., 24, 946-950 . Labeit, S ., Gibson, T ., Lakey, A ., Leonard, K ., Zeviani, M ., Knight, P ., Wardale, J . & Trinick, J. (1991) . Evidence that nebulin is a protein-ruler in muscle thin filaments . FEBS Lett., 282, 313-316 . Leistler, B . & Perham, R .N. (1994). Solubilizing buried domain of proteins : a soluble interface domain from glutathione reductase . Biochemistry, 33, 2773-2781 . Liddington, R .C., Yan, Y ., Moulai, J ., Sahli, R . Benjamin, T.L. & Harrison, S .C. (1991). Structure
Structural aspects of biomolecular recognition of simian virus 40 at 3 . 8 A resolution . Nature, 354, 278-284 . Marvin, D .A ., Hale, R.D ., Nave, C . & Helmer Citterich, M. (1994) . Molecular models and structural comparisons of native and mutant class I filamentous bacteriophages . Ff (fd, fl, M13), Ifl and Ike . J. Mol. Biol ., 235, 260-286 . Mattevi, A ., Obmolova, G ., Kalk, K . H ., Westphal, A . H ., de Kok, A . & Hol, W. G . J . (1993) . Refined crystal structure of the catalytic domain of dihydrolipoyl transacetylase (E2p) from Azotobacter vinelandii at 2 .6 A resolution. J. Mol. Biol., 230, 1183-1199 . McKenna, R., Xia, D ., Willingmann, P ., Ilag, L .L., Krishnaswamy, S ., Rossmann, M .G., Olson, N.H ., Baker, T .S . & Incardona, N .L . (1992) . Atomic structure of single-stranded DNA bacteriophage wpX174 and its functional implications . Nature, 355, 137-143. Mittl, P .R .E., Berry, A ., Scrutton, N .S ., Perham, R.N . & Schulz, G.E. (1993) . Structural differences between wild-type NADP-dependent glutathione reductase from Escherichia coli and a redesigned NAD-dependent mutant. J. Mol. Biol., 231, 191-195 . Model, P . & Russel, M . (1988) . Filamentous bacteriophage . In The Bacteriophages, Vol . 2, Ed . R . Calendar . Plenum, New York, pp . 375-456 . Muller, Y .A., Lindqvist, Y., Furey, W., Schulz, G.E ., Jordan, F . & Schneider, G . (1993) . A thiamin diphosphate binding fold revealed by comparison of the crystal structures of transketolase, pyruvate oxidase and pyruvate decarboxylase . Structure, 1, 95-103 . Perham, R.N . (1991). Domains, motifs and linkers in 2oxo acid dehydrogenase multienzyme complexes : a paradigm in the design of a multifunctional protein. Biochemistry, 30, 8501-8512 . Reed, L .J. & Hackert, M .L . (1990) . Structure-function relationships in dihydrolipoamide acyltransferases . J. Biol. Chem., 265, 8971-8974. Rossmann, M . G., Liljas, A., Branden, C .I . & Banaszak, L .J. (1975) . Evolutionary and structural relationships among dehydrogenases . The Enzymes, 11, 61-102 . Russell, G .C . & Guest, J .R. (1991) . Sequence similarities within the family of dihydrolipoamide acyltransferases and discovery of a previously unidentified fungal enzyme . Biochim . Biophys. Acta, 1076, 225-232. Scrutton, N .S ., Berry, A. & Perham, R.N . (1990) . Redesign of the coenzyme specificity of a dehydrogenase by protein engineering . Nature, 343, 38-43 . Tsao, J ., Chapman, M.S ., Agbandje, M ., Keller, W ., Smith, K., Wu, H ., Luo, M., Smith, T.J., Rossman, M .G., Compans, R .W. & Parrish, C .R . (1991) . The three dimensional structure of canine 759
Richard N. Perham parvovirus and its functional implications . Science, 251, 1456-1464 . Wierenga, R. K., De Maeyer, M .C .H. & Hol, W .G .J .
760
Biosensors & Bioelectronics (1985). Interaction of pyrophosphate moieties with a-helices in dinucleotide binding proteins . Biochemistry, 24, 1346-1357 .