364
Protein fold irregularities that hinder sequence analysis Robert B Russell* and Chris P Pontingt The detection of homologous protein sequences frequently provides useful predictions of function and structure. Methods for homology searching have continued to improve, such that very distant evolutionary relationships can now be detected. Little attention has been paid, however, to the problems of detecting homology when domains are inserted or permuted. Here we review recent occurrences of these phenomena and discuss methods that permit their detection. Addresses *SmithKline Beecham Pharmaceuticals, Bioinformatics, New Frontiers Science Park (North), Third Avenue, Harlow, Essex CM19 5AW, UK; e-mail: russelrl @mh.uk.sbphrd.com tUniversity of Oxford, Oxford Centre for Molecular Sciences, The Old Observatory, South Parks Road, Oxford OX1 3RH, UK; e-mail:
[email protected] Current Opinion in Structural Biology 1998, 8:364-371 http://biomednet.com/elecref/O959440XO0800364 c Current Biology Ltd ISSN 0959-440X Abbreviations Btk Bruton's tyrosine kinase CBD cellulose-binding domain Dbl diffuse B cell lymphoma DH Dbl homology elF-2et eukaryotic initiation factor-2c~ FAD flavin adenine dinucleotide FMN flavin mononucleotide HisRS histidyl-rRNA synthetase KH K homology NK natural killer PCll peptide chymotrypin inhibitor-1 PH pleckstrin homology ROKot Rho-associated kinase cz SH Src homology TMEGF5 thrombomodulin epidermal growth factor-like domain 5
Introduction T h e detection of s e q u e n c e similarity can be of great use during characterisation of new proteins. Proteins showing significant sequence similarity have a c o m m o n evohltionarv ancestor (i.e. they are homoh)gous [1]), a c o m m o n structure and, frequentl% a c o m m o n function. Aligmnents of homologous protein sequences can be used in sensitive database searches fl)r additional distant homologues (e.g. [21). T h e concept of protein domains is critical to all analyses of sequences. "Fhc division of proteins into their comp o n e n t globular s u b s t r u c t u r e s g e n e r a l l y partitions structural regions with individual flmctions. M u l t i d o m a i n proteins can comprise several different domains and/or collections of homologous domains that are often arranged in tandem with up to several dozen adjacent repeats [3]. Many popt, lar methods for the detection of similar protein sequences make general assumptions about the nature of domain structures. First, they assume that the optimal alienm e n t of two homologous protein domains will occur
between sequences of similar lengths, with o n l \ short regions of insertions or deletions. Second, they assume that the optimal alignment of two SCtluences will represent equivalent numbers of secondary structures in an identical sequence order in both proteins. Third, although not an assumption made by sequence comparison algorithms, it is also generally assumed that domains from multidonmin proteins form distinct and compact structures, x~ith intcrdomain interactions mediated mainly by surf:ace-exposed residues. Over recent years, evidence has accumulated to show that structures of clearlx homologous domains can differ in ways that challenge both the accurate alignment of multiple seqt,ences and the detection of homologues by database searching. Analyses of protein s e q u e n c e and three-dimensional structure have fimnd fold irregularities that provide exceptions to each of tile above assumptions. [)omain honmlogues arc now known to differ kS a restllt of large insertions [4], circular permutations 15] and secondary structure exchange [6] (Figure 1}. l | c r c , recent examples of these p h e n o m e n a are reviewed, and imprm cments to the sequence analysis tools that could improve their detection arc considered.
Domain insertions and extensions Protein structure d e t e r m i n a t i o n has revealed several examples of the insertion of one domain (the inserted domain) into another (the parent domain); sonletimes sequence or structural similarities with other domains d e m o n s t r a t e that homologues of inserted and parent domains can also exist i n d e p e n d e n t l y [4]. A prcviouslx d o c u m e n t e d e x a m p l e of domain insertion is that of the 'fingers' inserted into the
Protein fold irregularities that hinder sequence analysis Russell and Ponting
365
Figure 1
lalbl
cldl
I~lb
Icldl
< Insertion
Permutation
Secondary structure exchange Current Opinion in Structural Biology
A representation of the evolutionary hazards for domain structures: insertion, permutation and secondary structure exchange. The tandem repeat structure at the top of the figure (the shading denotes a second repeat) is modified (arrows) by domain insertion, permutation and secondary structure exchange. Sets of adjacent blocks in the order a, b, c, and d along the horizontal represent a compact globular domain. The black circle represents an unrelated domain.
larger than the catalytic domain in to which it is inserted [12"]. Crystallography [13] and sequence analysis [14] predict the prescnce of a large insertion (442 amino acids) within a T I M barrel in phospholipase C7 which contains three Src homoh)gy (SHZ-SH2-SH3) domains that are themselves embedded within a pleckstrin homology (PH) domain. Several other PH domains are involvcd in domain insertion events. PH domains accommodate singlc insertions of PDZ (in syntrophins [151) and C1 (in Rho-associated kinase oq ROKoL [16]) domains. A single PH domain is inserted into a band 4.1 domain in Mig2 [14]. Myosin X contains three adjacent PH domains: the second domain is inserted into the first, and the C-terminal PH domain is uninterrupted (DP (]orcy, BH Derfler, CK Solc, GM Duvk, RE (]henc> personal communication, ENIBI~ acccssinn code (755042). A major goal of multiple alignment methods is to predict the anaino acid limits of domains and thus enable their functional and structural characterisation. Different subsets of homologous domains may, however, differ in their domain limits. For example, SOSl-likc PH domains, inamediately C terminal to Dbl (diffuse B cell lymphoma) homology (DH) domains, may require an additional N-terminal o~ helix for structure and function [17"]. Bruton's tyrosine kinase (Btk)-like PH domains contain C-terminal extensions possessing the zinc-binding motif that is critical for Btk function [18"']. Similarl> subsets of ubiquitin-conjugating enzyme homologues [19",20 °] and K homology (KH) domains [Z1] are C-terminally truncated with respect to known strtlCttlrcs. Circular
permutations
Circular or cyclic permutations relate domains that are similar in structure and/or sequence, but whose N-terminal and C-terminal portions have been exchanged. Conceptuallx; this is equivalent to the ligation of a domain's N and C termini and an excision elsewhere within its polypeptide
chain. Exactly, this concept has been achieved by nature; concanavalin A is circularlx, permuted relative to its homolngue ravin [22] as a result of post-translational cleavage and ligation reactions rather than through permutation of one of the genes [23]. Permutation may appear to represent a drastic alteration to a protein fold, although experiments have shown that synthetic permuted variants of many proteins exhibit few differences in their folding or activity (reviewed in [24,25]). In contrast to the concanavalin A-ravin example, several clear instances of circular permutation occurring at the level of the gene have been rcpnrted recently (see [5] for a review), including permuted versions of saposin domains. The prediction that pernauted saposin honaologues ('swaposins') can be accommodated by the saposin fold as a result of both the close proximity <)fits N and C termini and the exposed position of the loop at the site of permutation [26] is in agrecmcnt with the recently determined structure of natural killer (NK)-lysin [27"], a saposin homologue. Swaposins were first identified [26] as domain inserts in plant aspartic pmteinascs [28]. A similar occurrence of an inserted domain that is also permuted is evident in XynC, a P/'~,ote//a l:nminkola xylanase. The authors of the original sequence paper correctly identified the 160 amino acid insertion within the XynC catalytic domain [29"], althnugh its permutation and low sequence identity relative to the family of cellulose-binding domains (CBDs) hindered its identification as a CBD homo[oguc. Assignment as a homologue was evident using iterative database searching methods [2] and a comparison of the resulting multiple alignment with the insertion sequence (C Ponting, unpublished data). Recentlx; Murzin [30°'] proposed tile existence of homology through permutation between the FMN binding protein from Desu/foc,ihrio ~'u/A*a/'isand a domain from tile ferredoxin reductase superfamily. Originally proposed to be a ancient
366
Sequences and topology
relative of trypsin-like serine proteinases [31], structural alignment following permutation suggests, however, that the FMN-binding protein more closely resembles a donqain from ferredoxin reductase in terms of structure and the nature and location of bound ligands. FMN binds in an equivalent location to FAD binding to the reductases, and the superimposition, including a six-stranded ~3 barrel and an ot helix, is striking, providing strong evidence that these domains are homologues through permutation. An obvious means of generating permutations involves the intragenic relocation of a gene segment, possibly via transposable elements. A more likely mechanism, however, is suggested from the frequent occurrence of tandem repeats in genes: a permuted protein can arise simply by extraction (i.e. duplication) of the C-terminal portion of one repeat together with the N-terminal portion of the subsequent repeat [26]. Since a permutation event effectively involves the filsion of the protein's N and C termini, it is often noted that the close proximity of N and C termini is a requirement for naturally-occurring permutations. Saposins and CBD homologues are good candidates for this mechanism of permutation, since both are found frequently as tandem repeats and their known structures [27%32] demonstrate the close proximities of their N and C termini. Due to difficulties in detection (see below), the abundance of permutations in current sequence databases is unknown. A search for permutations by comparing the sequences of domains with known three-dimensional structures [33] with current databases (RB Russell, AJ Whittaker, CP Ponting, unpublished data) revealed only a single example previously unknown to the authors: the permutation of potato polypeptide chymotrypsin inhibitor-1 (PCI1) relative to other homologous plant protease inhibitors. The known structure of PCI1 bound to Streptom),cesgriseus proteinase B [34] shows that it is a small disulphide-rich protein, with nonadjacent N and C termini. The authors noted that the sequence of the PCI1 structural domain
bridged across two repeated domains of a second potato proteinase inhibitor (inhibitor II). Thus, the PCII structure and the repeats found in inhibitor II are related by a circular permutation. T h e authors built a tertiary structure model of the two sequence repeats of inhibitor II based upon their PCII structure. T h e y were faced with the problem that each of the two repeats is permuted with respect to the PCI1 sequence: the region from the middle of repeat 1 to the middle of repeat 2 can be aligned with PCI1, as can the sequence of the C-terminal half of repeat 2 fused with the N-terminal half of repeat 1 (Figure 2). Thus they proposed a second PCIl-like structure that would lic adjacent to the central domain of inhibitor II in three dimensions and would be permuted with respect to PCI1. More recently, other plant proteinase inhibitor sequences have becn identified. These contain between two and six repeats [35], each similar either to PCI1 or to its permuted version (Figurc 2). We suggest that the pcrmutations within the repcat structurc are toleratcd by proximity of the multidomain polypeptide's N and C termini. In this model, the PCI1 permuted repeats located in the middle of the sequence fold into PCIl-like folds, while the remaining C- and N-terminal fragments meet and fold to a permuted variant of PCI1. Figure 3 shows a hypothetical model fi)r an inhibitor containing three such repeats. The observed sequence rcpcat does not correspond to a structural unit, and is unlikely to form a folded domain in isolation because the permuted structure (as seen in PCI1) has nonadjacent N and C termini. The permutation of the PCI1 structure is only tolerated when intervening domains can bridge the original (and nonproximal) N and C termini. The proposed staggered arrangement of structural repeats relative to sequence repeats is reminiscent of that observed in the 13subunit of G proteins [36]. Secondary
structure
exchange
T h e structures of monnmcric, multidomain proteins need not mirror those of their equivalent homodimers.
Figure 2
4SGB-I ITIH-I PCI2 ¢NICAL-I PCI2¢NICAL-2 PCI2 ¢NICAL-3 PCI2rNICAL-4 PCI2rNICAL-5 PIN2 tCAPAN-I PIN2 ¢CAPAN-2 PIN2 ¢CAPAN-3
33
N~A~PR~P~GI~LA--
ae 86
D~_~ITL~P~mGVmRS ,~I~TL~P~Gvw~ssz
144 202 260
" ~ A ~ P R ~ P ~ G N~A~P R ~ G ~ G N ~ A ~ P R ~ G ~ G
67
"~P~TL~P~F~SK~RS
125
Ntl~i~lP UolT L ~'~i~J P ~ F
5 3
SZ . . . . . .
1 . . . .
DR
K KN DR,~
,
-'
'
m~.~,.',,,~'
...... KX.DR.---.,'.--.,:,-I~LAEE ...... K KND R|''el'-" ILeaL S E E . . . . . . K K N D R|I~WeKW~.'~[~ I ~ L S E E . . . . . . K K N D S|t~,eK.~.'~[~
EG ...... ~ S K[il~R S E A ......
NA E N R"llml el[..;1[~ S A E Q p | I1~,~ el[~:'le.l:
32 85 143 201 259 317 66 124 182
Current Opinion in Structural Biology
Multiple alignment of homologues of the potato protease inhibitor PCI1, represented using the program Boxshade (K Hofmann, M Baron, unpublished data) and default parameters. Numbers represent amino acid numbers and dashes represent insertion/deletion positions. PCI1 (PDB code 4SGB-I) and a further homologue of known structure (1TIH-I) are shown as permutations of sequence repeats found in five copies of a precursor protease inhibitor from Nicotiana alata (PCI2/NICAL, GenBank identifier 542045; 397 residues) and two of three copies in the precursor inhibitor from Capsicum annuum (PIN2/CAPAN, GenBank identifier 2745898; 204 residues).
Protein fold irregularities that hinder sequence analysis Russell and Ponting
Figure 3
structures of homologous domains include rare occurrences of the nonconservation of disulphide-bonding patterns [50,51,52"].
Structural repeat (permuted) ......................
PCI1 C terminus
PCI1 C termi
367
.....
Sequence repeat
Current Opinion in Structural Biology
MOLSCRIPT [70] picture showing a hypothetical model of a three domain precursor that is homologous to PCII. The three sequence repeats that are permuted with respect to PCI1 are shaded light grey, grey and dark grey. Dashed boxes distinguish structural repeats from sequence repeats.
As discussed elsewhere [3,37], residue contacts in the intramolecular d o m a i n - d o m a i n interface within the monomer can be reproduced in a dimer by exchange of domains. This results in domain-doraain orientations for the two monomers in the dimeric structure that differ from the domain-domain orientation apparent for the monomeric structure.
Hint modules, inteins and exteins: duplications, insertions, exchanges, and permutations Inteins (internal proteins) are protein-splicing elements that are inserted into exteins (external proteins) within nascent polypeptides. Precursor proteins undergo autosplicing and ligation to form intein and extein protein products [53,54"']. The majority of known inteins contain two stnmtural domains that possess either protein splicing or endonuclease flmctions [55"']. The evolution of inteins is proposed to have occurred following gene fusion of protein splicing and endonuclease DNA, resulting in a mobile insertional element [55°']. J{1,cohacterium xenopi gyrase A is a rare example of an intein that only contains the splicing domain [56°°]. The splicing domain is a member of a family of homologues that includes the C-terminal autoprocessing domain of the secreted signalling protein Hedgehog [57"]. The recently solved crystal structure of the Hedgehog autoprocessing domain [57"] provides an example of a protein domain that demonstrates all of the fold irregularities described above. T h e autoprocessing domain consists of two lobes with similar structnres that are presumed to be the result of an ancient gene duplication. These two lobes pack together tightly and exchange a single [3 strand. A structure-based sequence alignment reveals that this is equivalent to the permutation of one subdomain with respect to a second, and insertion of one subdomain into the other [57°']. This suggests a second mechanism for generating gene permntations; secondarystructure exchange between tandem domain homologues in a single polypeptide, followed by truncation of the Nand C-terminal sequences, and resulting in a core, permuted domain.
Similarly, the tertiary strnctttre of single domain monomers may differ from those in their dimers. In these cases, residue contacts forming the hydmphobic interior of a single domain can be reproduced in the dimer by the exchange of one or more secondary structures. The result is two structures that are each essentially identical to the monomer, except that each contains structural contributions from the other chain. The exchange of single secnndary structures has been observed for CD2 [38], quaternary forms of bovine seminal ribonuclease [39], ecotin [40] and human CkHs2 [41], and an engineered versinn of Stap4r/ot~cra/ nuclease [42]. Greater numbers of secondary structures are swapped in dimers of interleukins-y, -5 and -10 [43-45], Eps8 SH3 [46"1 and bovine odorant-binding protein [47,481.
The challenge for sequence comparison methods
Spatial rearrangements of secondary strnctures within monomeric strttctures of homologons proteins are rare and are related mostly to the conformational changes that regulate function, for example, the serpin family of serine protease inhibitors [49]. Other nonequivalences within
Neither interdomain secondary-structure exchange nor intradomain disulphide exchange are detectable by analyses of their sequences. The detection of insertions and permutations are tractable problems that are not, however, currently addressed explicitly by the widely used sequence comparison algorithms. Both are comparatively
Analysis of the wider family of Hint (Hedgehog-intein) homologues reveals other irregularities in structure. T h e domain family includes the protein-splicing domains of inteins as shown by sequence analysis ([57"] and references therein) and by structure determination [56"',57 ' ' ] to be domain insertions within a variety of parent domain contexts. Hint modules themselves accommodate domain insertions in at least three independent sites (Figure 4).
368
Sequences and topology
Figure 4
Figure 5
lalblc
Id I
sertionl /
I 30 r inse u i~1
t N Itu
~^t~N
I
\ DNA-recognition domain 2-64 residue insertion
Current Opinionin StructuralBiology MOLSCRIPT [70] figure of the tertiary structure of the hedgehog autoprocessing domain - a Hint module (PDB code 1AT0). The different shading on the structure shows how the repeat structure leads to ~,-strand exchange and an apparent permutation. The boxes represent the location and maximal length of insertions tolerated at three sites on the fold. Adapted with permission from [57°'].
rarc, althongh their e x i s t c n c c warrants m e t h o d s to aid their d c t c c t i o n . A p p r o a c h e s for d e t e c t i n g insertions and p c r m u t a t i o n s should bc similar since a l i g n m e n t s for both consist of aligned s e g m e n t s that are e i t h e r s e p a r a t ed bv a long length of i n t e r v e n i n g s e q u e n c e (insertions), or reversed in s e q u e n c e order ( p e r m u t a t i o n s ) . T h i s is shown s c h c m a t i c a l l v b \ way of dot plots in F i g u r e 5. "['he d e t e c t i o n of insertion and p e r m u t a t i o n via h o m o l o gY is l i k e h to s u c c e e d by one of txvo strategies: p o s t p r o t o s s i n g of o u t p u t from p r e - e x i s t i n g p a c k a g e s : or modifications to e x i s t i n g search m e t h o d s and the design of new algorithms.
T h e first approach stiflers from an inherent problem in that insertion or p e r m u t a t i o n detection requires that two local alignments be d e t e c t e d with statistically significant a l i g n m e n t scores. If the homology is distant, then one or both nf these segments (and thus the insertion or permutation) may be missed. G i v e n that both s e g m e n t s score above a threshold, postprocessing oE for example, B L A S T [58], output can bc uscd to d c t c c t insertions or permutations. T h e same holds true for dynamic programming m e t h o d s [59] that report more than one a l i g n m e n t bctxxccn query and database s c q u c n c c (e.g. [60-621), althongh notably not S S E A R ( ] H [63], xxhich outputs only a single alignment). For inserted domains, it is also conceivablc that modified affine nap penalties could permit very long insertions, although this is likely to producc fitlsc positivcs xxithin d o m a i n s occurring as t a n d e m repeats (see belmv).
(b)
Ib
Ic
I d
Current Opinionin StructuralBiology Schematic dot plot representations of sequence similarities between a pair of homologues where one of the homologues contains (a) an inserted domain or is (b) permuted. Representation of the domains is as for Figure 1. The sequence of an unmodified domain runs horizontally (x-axis), whereas that of an inserted or permuted variant runs vertically (y-axis). Solid lines within the boxes represent regions of sequence similarity.
M o t i f - b a s e d scarch protocols such as M o S T 1641 provide a strong potential for d e t e c t i n g domain insertions and pcrmutations, since these do not asstime the rclativc spacing or order of u n g a p p c d a l i g n m c n t blocks within d c t c c t c d homologucs. T h e use of M o S T for this purpose is, hmvever, labour intensivc, since a l i g n m e n t blocks have to be p r e p a r e d and q u e r i e d indi~ictuallv, and thc rcsults of queries have to be correlated. [Tnfnrtunatclv. an a t t e m p t to a u t o m a t e this process fully [65], in ordcr to gain speed, assumes colinearity of a l i g n m e n t blocks, and
Protein fold irregularities that hinder sequence analysis Russell and Ponting
hence is applicablc only to the detection of inscrtitms and not permutations. Modifications to existing algorithms might permit the direct detection of domain insertions or permutations. New versions of BI,AST (W[!-BLAS'E W Gish, tmpublished data; P S I - B I , A S T [66"]) combine locally aligned segmcnts into a gapped alignment, but assume the colinearitv and close proximity of aligned segments. Relaxing these criteria, by combining alignment statistics for segments that are either well separated or permuted in order, might permit detection of less obvious inserted domains or pcrnmtations. Another possibility is to dexclop scts of dynamic programruing algorithms, possibly thrtmgh the usc of complex grammars (e.g. [67,68"]), although these problems are perhaps bettcr solved for higher order (e.g. context free) grammars a n d t h e r e f o r e p r e s e n t c o n s i d e r a b l e c t n n p u t a t i o n a l c h a l l e n g e s (E Birncy, p e r s o n a l c o m m t m i c a t i o n ) . Tandem
r e p e a t s w i t h l o w s i m i l a r i t y arc a p o t e n t i a l s o u r c c
of false positives in any search for domain insertions or permutations (RB Russell, CP Ponting, unptd)lishcd data). For example, ctmsider a situation where the match of a singlc domain protein to anothcr containing a pair of repeats generates four aligmnents which, ranked by score, are: 1. f h c C repeat 1. 2. T h e N repeat 2. 3. T h e N repeat l. 4. T h e C repeat 2.
terminus of the query with the C terminus of terminus of the query with the N terminus of tcrminus of the qucry with the N terminus of terminus of the query with the C terminus of
In pnstprocessing, the distinction between two rcpcats and an insertion or permutation event relies nptm the detectitm o f all four a l i g n m c n t s . If a l i g n m e n t s 3 a n d 4 score b e l o w a g i v e n t h r e s h o l d , t h e n p o s t p r o c e s s i n g will e r r o n e o u s l y r e p o r t a c i r c u l a r p e r m u t a t i o n . A p o s s i b l e s t r a t e g y for avoiding s u c h false p o s i t i v e p e r m u t a t i o n s or i n s e r t i o n s is to m a s k t a n d e m r e p e a t s in a p r c p r o c e s s i n g s t e p u s i n g a m e t h o d s u c h as t h a t o f H e r i n g a a n d Argos [69].
Conclusions Nature is clearly capable of gcnerating surprising fold irregularities. T h e systematic detection of these irrcgularitics within s e q u e n c e databases, particularl.~ when sequence divergence is substantial, rcquires some modifications to current scqnencc homology searching protocols. Analysis of these fold irrcgularities will not only assist in the detection of new members of diverse sequence families, b u t will i m p r o v e o u r t m d e r s t a n d i n g o f p r o t e i n s t r u c turc and evolution.
Acknowledgements I'~BR thanks Chris Rawlings, [)a~id Scads and Ford Calhoun (SmirhKlinc P,cccham) for their support alld CllCOllra~cmcnL (:PP is a \Vcllcolnc "['rust
369
Carccr Dcxclopmcnt [:cllm~ and a mcmbcr of the ()xfl~rd Ccntrc fl~r Molecular Sciences. f h c authors thank Alex Whittaker (hnpcria] Canccr Research Fund, I.ondon) for assistance during the search for new examples of" pcrmutariens and arc gratcfld to Daxid Scads. Rich Coplcx (%mithKlinc F;cccham) and ]';x~anF~irnc~,(Sangcr Centre) fi)r helpful discussions.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest • • of outstanding interest 1.
Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem 1995, 64:287-314.
2.
Bork P, Gibson TJ: Applying motif and profile searches. Methods Enzymol 1996, 266:162-184.
3.
Heringa J, Taylor WR: Three-dimensional domain duplication, swapping and stealing. Curr Opin Struct Biol 1997, 7:416-421.
4.
Russell RB: Domain insertion. Protein Eng 1994, 7:1407-1410.
5.
Lindqvist Y, Schneider G: Circular permutation of natural protein sequences: structural evidence. Curt Opin Struct Biol 1997, 7:422-427.
6.
Fletterick RJ, Bazan JF: When one and one are not two. Nat Struct Biol 1995, 2:721-723.
7. Zhang G, Liu Y, Ruoho AE, Hurley JH: Structure of the adenylyl • = cyclase catalytic core. Nature 1997, 386:247-253. See annotation by Artymiuk et al. 1997 [8"']. 8. Artymiuk PJ, Poirrette AR, Rice DW, Willet P: A polymerase I palm in •adenylyl cyclase? Nature 1997, 388:33-34. The authors describe a structural similarity between adenylyl cyclase [7 °°] and the DNA polymerase palm domain. There is little sequence similarity between the two proteins, although a similar reaction mechanism and key residue conservation suggest that the cyclase contains only the palm domain (i.e. the catalytic centre) of polymerases. Additions to the fold seen in polymerases serve mostly to regulate binding and specificity; different domain additions (not insertions) to cyclases (such as a second catalytic domain and transmembrane regions) modify function and cellular location. 9. •
Doublie S, Tabor S, Long AM, Richardson CC, Ellenberger T: Crystal structure of a bacteriophage T7 DNA replication complex at 2.2 A resolution. Nature 1997, 391:251-258. The relative orientations of domains in DNA polymerases appear to be crucial to their function. Bacteriophage T7 DNA polymerase contains a 71 residue thioredoxin-binding domain, inserted into the tip of the thumb, that mediates a thioredoxin-dependent increase in processive DNA synthesis. 10. ,&,berg A, Yaremchuk A, Tukalo M, Rasmussen B, Cusack S: Crystal • structure analysis of the activation of histidine by Thermus thermophilus histidyl-tRNA synthetase, Biochemistry1997, 36:3084-3094. In common with other class II synthetases, T. thermophilus histidyl-tRNA synthetase contains a domain insert that might assist in tRNA binding. The corresponding domain in the previously determined structure of E. coil hietidyl-tRNA synthetase was completely disordered. 11. Kostrewa D, Gr0ninger-Leitch F, D'Arcy A, Broger C, Mitchell D, van • Loon APGM: Crystal structure of phytase from Aspergillus ficuum at 2.5 A resolution. Nat Struct Biol 1997, 4:185-190. Phytase is similar in both sequence and structure to rat acid phosphatase (PDB accession code 1RPA). The inserted domain in phytase, which is composed of several excursions from the od~ domain, contributes to an active site that is, however, larger than that of rat acid phosphatase as a result of its preference for larger substrates. 12. M6hrle JJ, Zhao Y, Wernli B, Franklin RM, Kappes B: Molecular • cloning, characterization and localization of PfPK4, an elF-2~ kinase-related enzyme from the malarial parasite Plasmodium falciparum. Biochem J 1997, 328:6?7-687. The protozoan protein kinase PfPK4 is closely related to mammalian eukaryotic initiation factor 2c~ (elF-2o0 kinases, both of which contain two inserts. The first of these is usually extended for mammalian elF-2c( kinases, and is extremely long for PfPK4. No domain homologues have been detected within these inserts. 13.
Essen L-O, Perisic O, Katan M, Williams RL: Crystal structure of a mammalian phosphoinositide-specific phospholipase C& Nature 1996, 380:595-602.
14. Musacchio A, Gibson T, Rice P, Thompson J, Saraste M: The PH domain: a common piece in the structural patchwork of signalling proteins. Trends Biochem Sci 1993, 18:343-348.
370
Sequences and t o p o l o g y
15. Ponting CP, Phillips C: DHR domains in syntrophins, neuronal NO synthases and other intracellular proteins. Trends Biochem Sci 1995, 20:102-103. 16. Leung T, Chen XQ, Manser E, Lim L: The p160 RheA-binding kinase ROK alpha is a member of the kinase family and is involved in the reorganization of the cytoskeleton. Me/Cell Bio/ 1996, 16:5313-5327. 1Z •
Zheng J, Chen R-H, Corblan-Garcia S, Cahill SM, Bar-Sagi D, Cowburn D: The solution structure of the pleckstrin homology domain of human SOS1. A possible structural role for the sequential association of diffuse B cell lymphoma and pleckstrin homology domains. J Bio/Chem 1997, 272:30340-30344. Dbl-homology (DH) domains that are contained in several proto-oncogene products are invariably followed in sequence by a pleckstrin homology (PH) domain whose probable role is to recruit the protein to the cell membrane. The attempts by Cowburn and co-workers to express in E. coil a recombinant PH domain from the DH-containing SOS1 protein using commonly used domain limits from sequence analysis failed due to protein insolubility. Extending the domain at its N terminus, however, resulted in a soluble product whose structure shows a well-defined N-terminal ~ helix packed against an interstrand loop that is predicted to modulate the binding of inositol 1,4,5-trisphosphate. 18. Hyv6nen M, Saraste M: Structure of the PH domain and Btk motif •. from Bruton's tyrosine kinase: molecular explanations for X-linked agammaglobulinaemia. EMBO J 1997, 16:3396-3404. A large number of mutations in the gene for Bruton's tyrosine kinase (Btk) have been found in patients suffering from X-linked agammaglobulinaemia. One of these mutations results in the substitution of a conserved cysteine that coordinates zinc binding in a C-terminal extension of the PH domains of Btk-like molecules. The fold of the Btk extension appears to require the presence of the associated PH domain since it packs closely against it in the crystal structure. 19. Ponting CP, Cai Y-D, Bork P: Breast cancer gene product TSG101: • a regulator of ubiquitination? J Mol Med 1997, 75:467-469. See annotation to Koonin and Abagyan 1997 [20"]. 20. Koonin EV, Abagyan RA: TSG101 may be the prototype of a class • of dominant negative ubiquitin regulators. Nat Genet 1997, 16:330-331. TSG101 has been identified as a homologue of ubiquitin-conjugating enzymes. It appears to differ from the known structures of ubiquitinconjugating enzymes, however, lacking their active-site cysteines and being truncated at its C terminus. 21.
Musco G, Stier G, Joseph C, Morelli MAC, Nilges M, Gibson TJ, Pastore A: Three-dimensional structure and stability of the KH domain: molecular insights into the fragile X syndrome. Ceil 1996, 85:23?-245.
22. Cunningham BA, Hemperly JJ, Hopp TP, Edelman GM: Favin versus concanavalin A circularly permuted amino acid sequences. Proc Nat/Acad Sci USA 1979, 76:3218-3222. 23. Yamauchi D, Minamikawa T: Structure of the gene encoding concanavalin A from C. gladiata and its expression in E. coil cells. FEBS Lett 1990, 260:127-130. 24.
Goldberg DP: Circularly permuted proteins. Protein Eng 1989, 2:493-495.
25.
Pan T, Uhlenbeck OC: Circularly permuted DNA, RNA and proteins - a review. Gene 1993, 125:111-114.
26.
Ponting CP, Russell RB: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci 1995, 20:179-180.
27. •
Liepinsh E, Andersson M, Ruysschaert JM, Otting G: Saposin fold revealed by the N M R structure of N K-lysin. Nat Struct Biol 1997, 4:793-?95. The first structure determination of a saposin homologue. The novel fold shows how the previously reported sequence permutation [26] can be accommodated by the tertiary structure of saposins. It appears possible to relocate the N and C termini with minimal disruption to the overall structure. 28. Guruprasad K, Tormakangas K, Kervinen J, Blundell TL: Comparative modelling of barley-grain aspartic proteinase: a structural rationale for observed hydrolytic specificity. FEBS Lett 1994, 352:131-136. 29. Flint H J, Whitehead TR, Martin JC, Gasparic A: Interrupted catalytic • domain structure in xylanases from two distantly related strains of Prevotella ruminicola. Biochim Biophys Acta 1997, 1337:161-165. A report of the structures xylanases from two strains of an anaerobic bacterium, in which the catalytic domain of one is interrupted by a domain that is distantly related to cellulose-binding domains, although permuted.
30. Murzin AG: Probable circular permutation in the favin-binding •. domain. Nat Struct Bio/1998, 5:101. The recently solved structure of the FMN-binding protein from Desu/fovibrio vu/garis [31] was proposed to share an ancient common ancestor with domains from trypsin-like serine proteinases. Structural alignment after permutation, however, reveals a closer relationship with the ferredoxin reducatase superfamily of enzymes. Seven ~, strands forming the core barrel structure, an (~ helix, and an equivalent FAD-FMN binding site were found following circular permutation and structure superimposition. 31.
Liepinsh E, Kitamura M, Murakami T, Nakaya T, Ottig G: Pathway of chymotrypsin evolution suggested by the structure of the FM Nbinding protein from Desulfovibrio vulgaris. Nat Struct Bio/199?, 4:975-979.
32. Johnson PE, Joshi MD, Tomme P, Kilburn DG, Mclntosh LP: Structure of the N-terminal cellulose-binding domain of Cellulomonas fimi C e n t determined by nuclear magnetic resonance spectroscopy. Biochemistry 1996, 36:14381-14394. 33.
Islam SA, Luo J, Sternberg MJE: Identification and analysis of domains in proteins. Protein Eng 1995, 8:523-525.
34. Greenblatt HM, Ryan CA, James MNG: Structure of the complex of Streptomyces griseus proteinase B and polypeptide chymotrypsin inhibitor-1 from Russet Burbank potato tubers at 2.1 A resolution. J Mol Biol 1989, 205:201-228. 35. Taylor BH, Young RJ, Scheuring CF: Induction of a proteinase inhibitor II - a class of gene by auxin in tomato roots. Plant Mol Biol 1993, 23:1005-1014. 36. Wall MA, Coleman DE, Lee E, h~iguez-Lluhi JA, Posner BA, Gilman AG, Sprang SR: The structure of the G protein heterotrimer Gia16172. Cell 1995, 83:1047-1058. 37.
Bennett M J, Schlunegger MP, Eisenberg D: 3D domain swapping: a mechanism for oligomer assembly. Protein Sci 1995, 4:2455-2468.
38.
Murray A J, Lewis S J, Barclay AN, Brady RL: One sequence, two folds: a metastable structure of CD2. Proc Nat/Acad Sci USA 1995, 92:7337-7341.
39.
Piccoli R, Tamburrini M, Piccialli G, Di Donate A, Parente A, D'Alessio G: The dual-mode quaternary structure of seminal RNase. Proc Nat/Acad Sci USA 1992, 89:1870-1874.
40. McGrath ME, Gillmor SA, Fletterick RJ: Ecotin: lessons on survival in a protease-fUled world. Protein Sci 1995, 4:141-148. 41.
Parge HE, Arvai AS, Murtari DJ, Reed SI, Tainer JA: Human CksHs2 atomic structure: a role for its hexameric assembly in cell cycle control. Science 1993, 262:387-395.
42.
Green SM, Gittis AG, Meeker AK, Lattman EE: One-step evolution of a dimer from a monomeric protein. Nat Struct Biol 1995, 2:746-751.
43.
Ealick SE, Cook WJ, Vijay-Kumar S, Carson M, Nagabhushan TL, Trotta PP, Bugg CE: Three-dimensional structure of recombinant human interferon-"/. Science 1991,252:698-702.
44. Milburn MV, Hassell AM, Lambert MH, Jordan SR, Proudfoot AEI, Graber P, Wells TNC: A novel dimer configuration revealed by the crystal structure at 2.4 A resolution of human interleukin-5. Nature 1993, 363:172-176. 45. Zdanov A, Wlodawer functional interferon
Schalk-Hihi C, Gustchina A, Tsang M, Weatherbee J, A: Crystal structure of interleukin-10 reveals the dimer with an unexpected topological similarity to 7. Structure 1995, 3:591-601,
46. Kishan KVR, Scita G, Wong WT, Di Fiore PP, Newcomer ME: The •. SH3 domain of Eps8 exists as a novel intertwined dimer. Nat Struct Bio/1997, 4:739-743. The crystal structure of the Eps8 SH3 domain dimer shows that it is formed by J3-strand exchange. The authors show using co-immunoprecipitation experiments that full length Eps8 is a dimer in vivo and that dimerisation is dependent on an intact SH3 domain. Since the SH3 domain polyprolinebinding site is partially occluded in the dimer, the possibility remains that the regulation of Eps8 function may occur via SH3-mediated reversible dimerisation. Tandem domains can not always be considered to represent independent structural domains. 47.
Tegoni M, Ramoni R, Bignetti E, Spinelli S, Cambillau C: Domain swapping creates a third putative combining site in bovine odorant binding protein dimer. Nat Struct Bio/1996, 3:863-867.
48.
Bianchet MA, Bains G, Pelosi P, Pevsner J, Snyder SH, Monaco HL, Amzel LM: The three-dimensional structure of bovine odorant binding protein and its mechanism of odor recognition. Nat Struct Biol 1996, 3:934-939.
Protein fold irregularities that hinder sequence analysis Russell and Ponting
49. Mottonen J, Strand A, Symersky J, Sweet RM, Danley De, Geoghegan KF, Gerard RD, Goldsmith EJ: Structural basis of latency in plasminogen activator inhibitor-l. Nature 1992, 355:270-2?3. 50. Cooke RM, Carter BG, Murray-Rust P, Hartshorn M J, Herzyk P, Hubbard RE: The solution structure of echistatin: evidence for disulphide bond rearrangement in homologous snake toxins. Protein Eng 1992, 5:473-4?7. 51. Zimmermann GR, Legault P, Selsted ME, Pardi A: Solution structure of bovine neutrophil I~-defensin-12: the peptide fold of the 6defensins is identical to that of the classical defensins. Biochemistry 1995, 34:13663-13671. 52. Benitez BAS, Hunter M J, Meininger DP, Komives EA: Structure of the ee fifth EGF-like domain of thrombomodulin: an EGF-like domain with a novel disulfide-bonding pattern. J Mo/B/o/1997, 273:913-926. The NMR structure of thrombomodulin epidermal growth factor-like domain 5 (TMEGF5) shows a different disulphide-bonding pattern from that of EGF itself. In addition, a two-stranded I~ sheet that is common to most EGF homologues is lacking in TMEGF5. The TMEGF5 fragment with this novel disulphide-bonding pattern retains the thrombin-binding function of intact thrombomodulin. 53.
Perler FB, Davis EO, Dean GE, Gimble FS, Jack WE, Neff N, Noren C J, Thorner J, Belfort M: Protein splicing elements: inteins and exteins - a definition of terms and recommended nomenclature. Nucleic Acids Res 1994, 22:1125-1127.
Perler FB: Protein splicing of inteins and hedgehog autopmteolysis: structure, function, and evolution. Ceil 1998, 9:1-4. excellent review discussing the recent advances in the understanding of the structure, function and evolution of inteins. 54. An
55. Duan X, Gimble F, Quiocho F: Crystal structure of PI-Scel, a homing •autoproteolysis:endonuclease with protein splicing activity. Ceil 1997, 89:555-564. The first view of the folds of the two domains commonly contained in inteins. Domain II is a core endonuclease that is inserted into domain I, whose fold is similar to that of the Hedgehog autoprocessing domain. Domain I also accommodates an insertion that mediates DNA recognition. 56. Klabunde T, Sharma S, Telenti A, Jacobs WR Jr, Sacchettini JC: e• Crystal structure of GyrA intein from Mycobacterium xenopi reveals structural basis of protein splicing. Nat Struct Biol 1998, 5:31-36. The GyrA intein differs from most inteins in lacking both the endonuclease domain and the DNA recognition region. Its structure and function are strikingly similar to those of the Hedgehog autoprocessing domain. 57. ••
Tanaka-Hall TM, Porter JA, Young KE, Koonin EV, Beachy PA, Leahy DJ: Crystal structure of a Hedgehog autoprocessing domain: homology between Hedgehog and self-splicing proteins. Ceil 199?, 91:85-97. The first reported three-dimensional structure of an out•processing domain that is homologous to domain I of inteins. The structure provides insights into the mechanism of these out•catalytic modules and, together with a wider family of homologues, provides examples of domain duplication, insertion, permutation and secondary structure exchange. 58. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
371
59. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981,147:195-19?. 60. Zuker M: Suboptimal sequence alignment in molecular biology: alignment with error analysis. J Mol Biol 1991,221:403-420.
61. Barton G J: An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. Comput Appl Biosci 1993, 9:729-734. 62.
Birney E, Thompson JD, Gibson TJ: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res 1996, 24:2730-2739.
63.
Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 1991,11:635-650.
64. Tatusov RI, Altschul SF, Koonin EV: Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Nat/Acad Sci USA 1994, 9:12091-12095. 65.
Neuwald AF, Liu JS, Lipman DJ, Lawrence CE: Extracting protein alignment models from the sequence database. Nucleic Acids Res 1997, 25:1665-167?.
66. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, ** Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 199?, 25:3389-3402. Modifications to BLAST have made the method more powerful. Gapped alignment now allows more continuous domains to be identified and iterafive database searches, fined tuned by the construction of profiles that are specific to a protein family, make the method more sensitive. The central assumptions of short gaps and contiguous segments mean, however, that permutations and insertions are still difficult to detect without postprocessing. 67.
Searls DB, Murphy KP: Automata-theoretic models of mutation and alignment. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. Edited by Rawlings C, Clark D, Altman R, Hunter L, Lengauer T, Wodak S. Menlo Park: AAAI Press; 1995; 341-349
68. Birney E, Durbin R: Dynamite: a flexible code generating language • for dynamic programming methods used in sequence comparison./smb 1997, 7:56-64. This paper presents a method for generating a dynamic programming code that is specific to an application, given a fairly simple textural description of the problem that needs to be solved. It may not be able to detect insertions or permutations directly, although certainly the possibility exists for the construction of complex grammars specific to the task. 69.
Heringa J, Argos P: A method to recognised distant repeats in protein sequences. Proteins 1993, 17:391-341.
70.
Kraulis PJ: MOLSCRIPT: a program to produce detailed and schematic plots of protein structures. J Appl Crystallogr 1991, 24:950-964.