Sequences and topology an inverse approach to the old folding problem Editorial overview Tom L. Blundell and Russell F. Doolittle Birkbeck College, London, UK and University of California, San Diego, California, USA Current Opinion in Structural Biology 1992, 2:381-383 Predicting the topology of a protein from its sequence is rather like authenticating the Holy Shroud. The sequence contains the imprint of the protein fold, but is it a helix bundle or an antiparallel strand? The Holy Shroud has the imprint of human form, but is it consistent with the crucifixion? Both can be considered projections of three-dimensional forms; the shroud has the two dimensions of a sheet, the sequence the one dimension of the polypeptide chain. Further, both are the subjects of intense discussion and religious fervour; they are constantly subjected to new knowledge and to new methodologies. The reviews that follov, provide ample evidence that the relationship between protein topology and sequence is the more challenging of these two much debated problems.
So how has all this helped the study of protein sequences and topology? It has done so by making it clear that most people have been asking the wrong question. We have asked: if we know the sequence, can we predict the topoiogy? The answer is still an unequivocal 'no'. Nevertheless, the comparative studies described in these reviews and elsewhere show that we might solve the 'inverse folding problem'. This poses the question: if we know the threedimensional structure of one protein, can we predict the sequences that might adopt a similar fold? The answer is a qualified 'yes'. Our confidence in the af~rmative comes from understanding restraints on sequence diversity in families of divergently evolved proteins. The qualification comes from debate about what constitutes a common fold.
But there has been progress. It has come not from the heavy calculations of physics, but rather from the study of families of divergent proteins such as those described in the reviews of Waterman (pp 384-387) on cytochrome P450 and Welinder (pp 388-393) on peroxidases in this section. These demonstrate that families of homologous proteins may have very few sequence identities even though they adopt the same tertiary fold. This theme is taken up by Overington (pp39Ze401) who reviews the remarkable similarity between hexokinase, actin and the 44 K heat-shock cognate protein, which share percentage sequence identities of less than 15 %. The many insertions and deletions that change the surface topography leave the core essentially the same in both molecules. There is strong evidence that the few sequence identities that are not functionally important contribute to the common core. Restraints on the sequences, such as side chain packing between key elements of secondary structure or the adoption of an unusual main-chain tor sion angles, are reflected in the conservative variation of sequences during evolution. Very similar observations can be made when the aspartic proteinases from cellular (pepsin-like) and retroviral origins are compared. In the aspartic proteinases, there is evidence for evolutionary restraints on polar side chains that are both inaccessi ble to solvent and buried to main-chain functions such as the conservatively varied (threonine or serine) central residue of the characteristic Asp Thr Gly sequence.
There have been several published approaches to the in verse folding problem. Ponder and Richards [1] used a library of side-chain rotamers and sought to find combinations of sequence and side-chain conformation that would allow retention of a known three-dimensional structure. The method of Eisenberg and coworkers [ 2,3] considered the probability of finding an amino acid in a particular local environment, and used this to generate a profile of sequence variation. Johnson and coworkers [4,5] have used expanded amino acid substitution tables that take into account the local environment in the te~ary structure. Yet another approach involves threading a sequence through a known structure (D Jones, WR Taylor and JM Thornton, personal communication), and asks a slightly different but related question: can the se quence of interest adopt a particular three-dimensional fold? Each of these methods aims to detect distantly related sequences that adopt a particular protein fold. If this can be done, then the morn traditional approaches of comparative modelling can be used to provide a model of the protein of interest. Of course, the power of the inverse approach is de pendent on what percentage of protein folds adopted by living organisms have a m e m b e r whose structure has been experimentally defined. This can never be known for sure, especially as X-ray analysis and NMR may not be powerful methods for the definition of certain types
Abbreviation
SH Src homology. (~) Current Biology Ltd ISSN 0959-440X
381
382 Sequencesand topology of structures. These are likely to include both membrane and other proteins that are marginally stable or unstable in aqueous solvents, and those that are either very flexible or very large. Most protein biochemists would probably agree that this is less serious than we originally thought; the number of folds adopted by integral membrane proteins is severely limited by the inability of lipids to make hydrogen bonds to peptide main-chain functions, and large and flexible proteins tend to be made from a mosaic of smaller modules that are amenable to study (see Bork, this issue, pp 413-421). Even problematic or phan s e q u e n c e s - - assigned to 'temporary one-member bunches' or T O M B S - - are frequently adopted by distant, but still very real, relatives. Most experts now agree that we have sampled at least 50 % of protein folds adopted in living organisms, which means that we should be able to predict the three-dimensional structures of half the sequences presented to us! Prediction of protein structure where there is no homologue, (or even a convergently evolved analogue) can also be assisted by the study of divergent evolution of protein families. Benner ( p p 4 0 2 4 1 2 ) describes a method developed at ZOrJch which allows one to predict the fold of a protein family, in the absence of a homologous three-dimensional structure, where the sequences of at least four members of the family are known. Indeed, the method is so straightforward that members of Benner's undergraduate classes have been successfully solving structures under examination conditions while other protein experts have tried in vain. How do they do it? The first step is to get a good alignment of the sequences. The second is to assume that proteins have evolved to be unstable and that this source of instability is passed around the sequence during evolutionary divergence. Although the Benner approach does not improve very much on classical secondary structure prediction when measured by percentage correctness of all secondary structure elements, it does much better at predicting core elements, i.e. those that are most important to the three-dimen sional fold. It certainly was effective in predicting the structure of the protein kinases. Protein kinases constitute just one of many types of modules that play an important part in the recently evolved, mosaic proteins (Bork, pp 413-421). There are now numerous examples of such modules among the modern proteins involved in signal transduction, differentiation, cell-cell communication, defence mechanisms and other important features of complex organisms. They are mobile units, which often correspond to exons and are shuffled both within and between molecules. As Bork points out, combining subfunctions is much more eflqcient than conventional modification of "already existing proteins by subsequent mutations. Many mobile modules appear to be flexibly linked together in mosaic proteins.-This has certainly been a surprise to many of us. The flexibility of the linker regions almost certainly means that information does not flow from module to module via conformational changes and allosteric interactions, as was first thought. Indeed, many complex proteins seem to be controlled by 'asso-
ciation'. This has been the reluctant conclusion of many seeking to understand the function of antibodies, that are now known to have contained duplicated modules of the immunoglobulin fold ever since the development of the lower vertebrates (see Hsu and Steiner, pp422-431). Antibody action in activating complement seems to involve association of complement with several anitbodies bound to antigen. A similar mechanism seems to occur in hormone and growth factor actions mediated by receptor tyrosyl kinases: the ligand induces dimerization of the receptor that then allows autophosphorylation by the kinase. As Pawson (pp432-437) demonstrates, the next stage of the transduction of the signal also involves functional modules and complex associations between proteins. The activated receptor tyrosyl kinases are recognized by Src homology (SH) region 2 domains, which bind specific sequences that contain a phosphorylated tyrosine residue. SH2 domains are important modules in the complex two-chain molecule phosphatidylinositol Y-kinase, which mediates the next stage of the signal transduction. Alignment of SH2 domains indicates the presence of several conserved motifs interspersed with variable regions. B Bax (personal communication) finds evidence in the sequence alignments for a helix followed by four antiparallel strands and a further helix. The invariant arginine within the sequence Phe-Leu-Val-ArgGlu-Ser and the conservative,ly varied arginine and histidine in other conserved motifs are candidates for phosphotyrosine binding and form simple sequence motifs that characterize these domains. Although simple motifs have played an important part in developing our understanding of the complex structures of mosaic proteins, the history of the prediction of protein kinases has been much upset by the simple motif GIy-X-GIy-X-X-GIy(where X represents any amino acid), once thought to be characteristic of the dehydrogenases (as Bork outlines in his review). It is a moral story akin to that which described the brownish-red stains on the Holy Shroud as evidence of the nails that pinned Jesus to the cross. Clearly nature, like the religiously fervent, can play intriguing hoaxes on us. The crystal structure shows, as did Benner before it was determined, that this motif occurs on an antiparallel IB-structure. In general, this should be no surprise: it is well known that simple sequence motifs have a high probability of occurring by chance, and they only have predictive value when there is firm evidence of the local elements of secondary structure.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •° of outstanding interest I. PONDERJW, R~CHARDSFM: Tertiary Templates For Proteins. Proteins 1987, 193:775-791. 2.
LUTTHYR, McLACHLANA, EISENBERG D: Secondary Structurebased Profiles: Use of Structure-conserving Scoring Tables
Editorial overview Blundel[ and Doolittle in Searching Protein Sequence Databases for Structural Similarities. Proteins 1991, 10:229-239.
3.
BOWIE JU, LUTTI-~ R, EISENBERGD: A Method to Identify Protein Sequences that Fold into a Known Three-dimensional Structure. Science 1991, 253:164-170. OVERINGTON JP, JOHNSON MS, ~AIJ A, BLUNDELLTL: Tertiary Structural Constraints on Evolutionary Diversity: Templates, Key Residues and Structure Prediction. Proc R Soc Lond [B] 1990, 241:146-152.
OVERINGTON JP, DONNELLYD, .~AIJ A, JOHNSON MS, BLUNDELL TL: Environment Specific Amino Acid Substitution Tables: Tertiary Templates and Prediction of Protein Folds. Protein Sci 1992, 1:216-226. TL Blundell, Imperial Cancer Research Fund Unit, Birkbeck College, Malet Street, London WClE 7HX and Agricultural and Food Research Council, Central Office, Swindon SN2 1UH, UK. RF Doolittle, Center for Molecular Genetics M-034, University of California, San Diego, La Jolla, California 92093, USA.
383