./ .llo/. /lid.
(I!,!,;) 228. 170- IX7
Evaluation
of the Sequence Template Method for Protein Structure Prediction Discrimination
of the (p&-Barrel
Fold
Stephen D. Pickett, Mansoor A. S. Saqi and Michael J. E. Sternberg Biovnolecular
Model&y Lahoratorg Research Fund 34, Lincoln’s Inn Fields London F(‘iA SPX. C’:.K.
Impwial
Cancer
(Received 2 ,January
1992; accepted .3’*July 199%)
A multiple alignment of five (B/cr),-b arrel enzymes has been derived from their structure. The eight P-strands and eight) r-helices of the @/a),-barrel are correctly aligned and tht equivalenced residues in these regions fulfil similar structural roles. Each P-&and has a central core of usually four residues, two residues contribute side-chains to the barrel core and the other two residues are involved in b-strand/a-helix contacts. However. the fold imposes no constraints on the volumes of the residues at either a local or global level: the volume of the b-barrel core varies between 1088 A” in glycolate oxidase and 1571 A3 in taka-amylase. Sequence motifs derived from the multiple alignment were scanned against a database ot 164 protein sequences, including 17 @/a)s-l carrel enzymes. The results were evaluated in t,erms of the discrimination of (/3/a),-barrel sequences and the quality of the alignments obtained. One motif was able to identify the t,op lS?b of high scoring sequences as forming (fi/m)s-barrels with 509~ accuracy and the bottom ,50°~~, of sequences as not being (p/a),-barrel proteins with 1000. accuracy. However. in most instances the alignments wcrc poor. The reasons for this are discussed wit,h reference to t)he (P/cc),-barrel protjeins and t.hc sequence motif method in general. Kqumds:
protein
folding;
enzyme structure: TIM-barrel: structure prediction
sequence
motifs:
1. Introduction The structure of a protein is encoded within its 1961). amino acid sequence (Anfinsen ct nl., However. the general problem of predicting protein structure from sequence remains unsolved and consequently alternative, partial solutions are being explored. Known protein structures contain a wealth of information which mav be utilized to det’ect proteins which adopt a similar tertiary fold. This task. of recognizing which known three-dimensional structure (if any) a sequence encodes. is often referred to as the inverse protein folding problem (Bow& it nl.. 1991) and a number of methods have been developed to tackle it. The (p/cc),-barrel farnil! of proteins, epitomized by triosc phosphat,e isomerasr (TTMT: Banner ef nl.. 1976). provides a significant challenge to these methods (Thornton P! al.. 1991) as the fold. adopted by a number of known enzyme structures (Farber & Petsko. 1990).
is tolerant of a wide-range of amino acid sequences. Tn this paper. the problem of identifying proteins which adopt the (/?/a),-barrel fold is addressed. Farber & T’rtsko (1990) detailed the 17 proteins then known to adopt the #/a),-barrel topology. A t Abbreviations usrd: TIM. triosr phosphate isomerase: hIT,E. muconate lactonizing rrynir: LVSY:I. !--subunit of tryptophan synthasr; (:lVE(‘. the hienzyme IMPS : PRAT: T(:PS. indole glycerolphosphatr synthasr: T’RAT. (phosphoribosgl)-anthranilate ixomerase: R,lrlS. ribuloscI.&bisphosphatr carboxylasr (rubisco); (:0X. glycolatr oxidase: ADI’SC~I’. 2’-krto-3-ctroxy-6phosI)hogluronate aldolase; T’YK. pyruvatr kinasc: 31). three-tlimrnsional: XTA. n-xylnsr isomrrast: T4.A. takaamylasc: I’I)H. Protein I)ata Bank; r.m.s.. root mean square: c.p.u.. csentral processor unit; I’IR. protein itirnt,ification resource; MAICRj. mandrlatr racernasc~: VIXT. cgclodextrin glycosyltransferase; ADRRA. fructose hisphosphate aldolasc.
(/?/a)*-Barrels
barrel is composed of eight parallel P-strands surrounded by eight a-helices, the connectivity being fluflu, etc. All are enzymes with the catalytic site at the carboxy-terminal end of the B-barrel, though there are structural variations from the (/?/a)* topology. For example, in enolase standard (Stec & Lebioda, 1990) the first two p-strands are antiparallel and the connectivity is flflaa@a),. Muconate lactonizing enzyme (MLE; Goldman et al., 1987) and mandelate racemase (Neidhart et al., 1991) have the eighth helix replaced by a separate domain. In many proteins there are additional r-helicex or domains located in the /3/a or a/a loops. The structures of two other enzymes with the (b/a),-barrel fold have recently been reported. adenosine deaminase (Wilson et al., 1991) and aldose reductase (R’ondeau et al., 1992). Cellobiohydrolase II (Rouvinen et al., 1990) is similar but has just seven /?-strands in the barrel, the first six being connected by a-helices. The similarities in structure have prompted a debate as to how these enzymes evolved. Farber & Petsko (1990) grouped the enzymes into four different. families based on structural criteria such as lengths of secondary structure elements and the location of additional a-helices or domains. They suggested that cyclic permutations provided a mechanism by which a single ancestral enzyme could have diverged into these four families. et (I/. (1991) have presented support for Wilmanns divergent evolution within a subset of @/a),-barrel enzymes from an analysis of two (fl/a)8-barrel enzymes in the tryptophan biosynthesis pathway, the a-subunit of tryptophan synthase (WSYA) and the two (/?/a),-barrel domains of the bienzyme GW’E(‘, (phosphoribosyl)-anthranilate isomerasee indole glycerolphosphate synthase (PRAI:TGPS). On the other hand, based on a,n analysis of sidechain packing. Lesk et al. (1989) have suggested convergent evolution to a stable (p/a)s-barrel fold. Efforts to explain the evolutionary origins and structural principles determining the (/?/a)s-barrel fold have prompted several analyses of this family of proteins. Lcsk et al. (1989) have examined the packing of P-strand residues into the barrel core of three @/a),-barrel proteins, TTM, rubisco (RUB: Schneider rt al.. 1990) and glycolate oxidase (GOX; Lindqvist, 1989). The residues form three layers, each layer consisting of the side-chain from a residue on four alternate /?-strands. Two distinct types of packing were found. In the first type, found in TIM and RUB, the odd-numbered /&strands of the barrel contribute one side-chain each to a middle layer. This packed layer is sandwiched by an upper and lower layer formed by two side-chains from each even-numbered B-strand. For GGX the situation is reversed and the even-numbered P-strands cont’ribute one side-chain to the middle layer. Tf glycine is the residue expected to contribute a side-chain then the gap is filled by side-chains either from a layer above or below or by other residues in the p-strands, this may lead to distortion of the barrel.
Lebioda et al. (1982) have compared the structures of 2-keto-3-deoxy-6-phosphogluconate aldolase (ADPSGP; Mavridis et al., 1982), TIM and domain A of pyruvate kinase (PYK; Stuart et al., 1979). The sequence of the latter was unknown at this time. They used the method of Rossmann & Argos (1975) to superimpose each pair of structures and locate equivalent positions. They also examined different orientations of the barrels with respect to each other and found no preference for t,he b-strand 1 to p-strand 1 superimposition over alternate orientations. Attention has also focused on the ideal geometry of barrels formed from b-sheets. McLachlan (1979) was the first to address this problem and his model has been shown to account quantitatively for the /?-barrels in TIM, RIJB and GOX (Lesk et al., 1989). Lasters et al. (1988) modelled t,he barrels of nine enzymes by fitting hyperboloids to the ,&barrels. Thev found that while the axial ratio of the crosssection of the barrels varied bet)ween I.0 and 1.48 the mean cross-sectional area was 163 A’ (1 B = 61 nm) with a standard deviation of only ten percent. This model has been used subsequently as a framework for the design of idealized (/?ja),-barrels (Lasters et al.. 1990). The papa. . connect’ivity of the barrels has led some workers t,o investigate the possibility that the st’ructures may contain a-loop-~ or fl-loop-a units found in other a//l proteins. Rice rt al. (1990) from exhaustively studied unit)s six a/B (P/r),-barrels and seven other a//&proteins by semautomatic least-squares fitting, using t,hr P-strand and a-helical residues as a guide to the initial equivalences. They found an a-helix-turna-st,rand motif common to MLE, TIM, ADPSGP and enolase and two other a/B-proteins, glutathione reductase and liver alcohol dehydrogenase. There was no evidence of similarit’y between b/a units. Scherrlinck et al. (1992) have found a preference for loops belonging to the a/31 and cl/?3 loop families found in other a/bproteins. The a/I3 loops invariably involve even numbered p-strands, while cl/?1 loops involve preferentially odd b-strands. A major aim of the work presented below, that has not been tackled previously, was to address the question of whether there are key residues that determine the (/?/a),-b arrel fold, permitting the identification of a sequence motif capable of detect’ing proteins adopting the (/?/a),-barrel topology, the inverse protein folding problem. If a sequence shows strong sequence similarity with a protein of known structure then model building techniques may be applied (e.g. Browne et al.: 1969; Blundell et al., 1986, 1987). However, when the similarity is not strong, as in the case of the (/?/a)s-barrel proteins, alternative methods tnust be used. Several methods for tackling the problem have been reviewed by Taylor (1988). Key features of a structure, or group of structures, are collected into a template which is then scanned against a query sequence. The inference is made that sequences
Table 1
‘I% cw~wpt of a flexible pat tern introducwl h>, f3arton B Sternhwg (1990) fwniits t hr nst’ of rrsitlur patterns tlrrivrtl frorn rrgions c~onsitlrrt~tl 10 IF itnport~ant within a protein family. srparatt~cl ti>. gaps of detineti range. Thus, regions krto\rtr to \-ar>’ wit bin a protjein family are tiiwartlt~d. incrc,asitty t.hr diwriminatory power of’ the patterns. Thea mrt hotl has hern tested on the glohin fold ant1 WAS al)lr to disc-ritninat)e glohins from non-globins wit hit1 the T’rotrin I&wtiticatioti Rrsoutw~ datatiaw. Tiewntly. fbwie it nl. (I 991) have approached t hc prohleni tiy finding seyuencw most cwtnpatihlr with thfb 1~nvirotrments of residues in a particwlar thtw~ tiimrnsional strucsturr. The t~nvironmcwt of a rrsitlw is cieswihrd I)J- thr area hurietl. the frwtion of thcl sidw~hairt co\-~rc~tl It! polar at.otrrs and t hr loc~wl wcwndarv struc~turr. Thus. t,he pritnary squrnc*t~ is rtllatcd to thr thtw-ditnensional (31)) struc+ure in a ttatttral \vay. This 31) profile mchthod \V~S tt~stt~cl sucwssfitlly on wvrral protein fatnilirs inc4uding thta :I’,:i’-tnortol,hos~,hat~, glohins, cyc*lic nt~rnositir~ receptor-like proteins. the periplasmic~ hintling proteins and the actins. In this payer. thr identific&ion of possible stywttw motifs for the (~/cl)8-l)arr(4 enzymw has hwtt approachrtl by genrrat~ing a multiple strutatural afigntnrnt of five non-homologous (/j/a),-barrel prot,eins. TTM. (:0X. RUB, wxylose isomeraw (X1.4: Henrick at rxl.. 19X9) and taka-amylase (TAA: hla,tsuurs rt nl.. 19X4). using a development of t,he tnt~thod of Taylor & Orengo (1989rr.h). A variety of tt~tnplates are derived from this alignment for scatsning against a sequence database using the m&hods of f-&-ton & Sternberg (1990) and Bowicb rt nl. (f!~Ul). The suc’wss of these mrlthods is assessed in t.rrtns both of thtl rank of @/a),-barrel rnzymw in t bra scans and the quality of the alignment~s gent+ at.4 with the templates. The implications with rrsfw~t. to thr (a/cr),-barrel enzymes and for styurnw motif methods arr tfiscuswd.
2. Methods (a) s/rvcturn/ &Ltn Five (b/a),-barrel
proteins have been studied in d&ail:
TIM (Banner et nl.. 1976). GOX (Lindqvist, 1989). Rl:R (Srhneider rt al., 1990). XIA (Hrnrick et al.. 1989) and TAX (Matsuura rt al.. 1984). The co-ordinate data sets for TTM, GOX, TA,4 and XIA wire available from the Hrookhaven Protein Structure Dat,a Hank (PDH: Bernstein ct al., 1977) identified by the following codes, ITTM, IGOX. 2TAA and 5XTA. The full set of coordinates for RITE were obtained from I)r Y. Lindqvist
availatilt~ as databank itrc IlOH tLntr> .;,R1.1<. Secondary structure definitions were takctr from thus 1’1) I< file helix and sheet records. Thr resolution of thts S-I.HJ structures and the rtyions of the strllc$urrs usrcl it1 ttw multiplr alignmrnt (sue st’c+ic)n 3. hrlow) art’ listtstl it) Table I ard
(rl(/r),-barrel acid srqrrc~ticr tnerlt
by
I)rotrins ha,r-cl a widr disparity ill a~nino and are not genrrall), amenahl(~ to itligrl standard techniques employing fi>r c~sarnplr a
Early approwhw to the, l)roblrttl of’ DayhofT matrix. aligning prot,eins by their st,ructurc as opposc~l to styuen~ i~tvolvrd the least-squares fitting of 2 structures topologically rquivalrnt rrsidurs (Ilao L? to locat f~ossmatt. 1973: R,ossmanrt & Argos. I!)75. l!J76. 1977). Alternatively. smallrr contiguous regions of 1 I)r.otrin may he. fitted by least-squares to t,he whole Irttgth of thr 2ntl protein (Remington & Matthews. 1978. 1!180). ‘I’hrsci methods have brrn reviewed by Matthews K: Rossmatul (1985) and have been used by several workers tcl gt~rrt~ratc~ I)airwisr superitnl)osit,ions of 3 (fljr)s-barrel struc%uws (T,ehioda rl ft/.. 1%3%:Wilmanns Pf al.. 1991 ). R(~centl~-. Sali & 13lundell (1990) have drrrlopecl a mrthotl inc,ot,porating dynamic programming and simulated annraling to produc~t~ rnultjiplr alignment,a based on such f&ators as amino a(*id identities. solvent accessibility. h~~tlrog~rl
twnding and nearest nrighbour
distancw.
In t,his work. thr method of Taylor & Orrngo (19W~.h) is used to generate a structural al&nc~nt for each pair of t,he 5 structures studied. We have extended the method to produce a rnultiplr alignment of the struc,turt+ txtsrtl (111 t,hesr pa,irwisr comparisons. In brief. interatomic~ xw%w are generat,rd between all C’” a.totns within a Io(~tl c’oordinate frame (dummy (‘” atom positions Iwing c~alc~ulatrd for glycinc residues). For each rrsiducl i in structure, n and rrsidur k in struct.urc h thr (‘p-(‘B vrc%)rs v,~ and vkl to all residues j and 1, respectively, are c~omparrd using the scoring scheme in rqn (1 ), where s is the s(~lrt’ and .-I and H are c*onstarits: s = d,l((v,j-V,,)2+
11).
(11
Each pair of residues i and k. has an associat,rd matrix termed the distance-level matrix into which these s(‘ores are accumulated. A dynamic programming algorithm is used to find the best-path through each of these matrices and the scores from the best-path accumulated in a residue-level matrix. specific to the I structures being aligned. Finally. t,he structural alignment of structures n and b is found by tracing the best-path through this residue-level matrix. The incorporation of a window to increase the speed of t’he procedure means t)hat the total
(/?/a)*-Barrels Table 2 List of proteins in the sequence database 2TAA 1TIMA SXIAA 2ENL MANR ADPSGP CDGT PYK ADRBA 1FCB 1MLE 2RUB 1GOX IGPS PRAI GWEC 1WSYA 1\c’SY 11 I.AliI’ II3.tlV2 l(‘JIS I ( “l-s 11)I’l 1F I 9 I, I F 19H I I”(‘1 ‘4 I F(‘J.21) IHFI I Hyl(;X IHSFF II(:EA I M(‘EV\V 1M( ‘\VJI 1ST1 I PK( ‘(‘ I PRC’I, I I’R( ‘21 1t’R( ‘H 1PT E 1R 1.AI 1RI AL’ I R I XI IRIEE IRHI) ISR’ 1SW1 1St :‘I l’l’F(‘IJ > 1 I’R’I r , I’lY:N I’I’HI I ‘I’ f’A E 2.i.i’l ?A( ‘T 2:\ PI “A PR i?13LX 3 -“(‘,A 13 ;I( 31,A “( ‘SA 2(‘PI 2vts41, “FB4H “FthJL -“Ft%JH 2(: BP 2(:uIo %(:LSA 2H FLL 2HFLH dHLr\A PI(::! 2LHP
Taka-Amylase a (E.C. 3.2.1.1), nres = 478 Triose phosphate isomerase (E.t1. 5.3.1.1), chain A. nres = 247 D-Xylose isomerase (E.C. 5.3.1.5). xylitoi complex. nres = 393 Enolase (E.(!. 4.2.1.11). nres = 436 Mandelate racemase (E.C. 5. t .2.2). nres = 359 Phospho-2.dehydro-3.deoxygluconate aldolase (E.C. 4.1.2.14). nrrs = 225 Cyclodextrin glycosyltransferase (E.(‘. 2.4.1.19), nrcs = 684 Pyruvate kinase. (E.C. 2.7.1.40), nres = 530 Fructose-bisphosphate aldolase (E.C‘. 4.1.2.13) A, nres = 363 Flavocytochrome II2 (E.C. 1.1.2.3). nres = 511 Muconate lartonizing enzyme, nres = 3il Rubisco, m-es = 466 Glycolate oxidasr (E.(‘. 1.1.3.1). nres = 369 Intlole-3.glycerol-phosphate synthase (E.t’. 4.1 .I .48). nres = 255 N-(5’-phosphoribosyI)anthramlate isomerase (F:.(‘. 5:::). nres = 198 IGPS/PRAI. nres = 452 Tryptophan sy-nthasr (E.t’. 4.2.1.20) chain A. nres = 268 Tryptophan synthase (E.C 4.2.1.20) chain R. nrcs = 39i L-Arabinose-binding protein. nres = 306 Bean pod mottle virus (middle component). chain 2. nres = 374 t’hymosin b. nres = 323 (‘itrate synthase (E.(‘. 4.1.3.7). nres = 437 UNA polymerase I (Klenon fragment) (E.t’. 2.7.7.7). nres = 605 R19.9 (igG2bk. (‘RI-a) fab fragment. chain I, Mouse. nres = 215 R19.9 (igG2bk. (‘RI-a) fab fragment. chain H Mouse. nres = 220 Fc fragment (iggl class). chain A Human. nres = 206 Immunoglobulin fc. nres = 206 Hannuka factor (model) Human (homo sapiens). nrex = 234 Haemagglutinin (bromelain digested). chain A. nres = 328 Human neutrophil elastase (HNE) (E.(‘. 3.4.21.37). nres = 21X Fc fragment (igE(prime)CL) (model). chain A. nrcs = 32% Immunoglobulin heterologous light cham dimer, nres = 216 lmmunoglobulin het,erologous light chain dimrr. nrrs = 216 Modified beta trypsin. nres = 223 Photosynthetic reaction centre. chain f‘. nres = 332 Photosynthetic reaction centrr. chain I,. nres = 273 Photosynthetic reaction centrr. chain M. nrrs = 323 Photosynthetic reaction centre. chain H. nrcs = 158 r)~.~lanyl-D-alanille carboxypeptidase. nres = 34X Rhinovirus serotype 1 (HKVl) coat protein. chain 1. nres = 283 Rhinovirus serotype 1 (HRVI ) coat protein, chain 2. nres = 253 Rhinovirus serotype I (HR\:l) coat protein, chain 3. nres = 238 I%YJri endonuclease (E.(‘. 3.1.21.4). nrrs = 261 Rhodanese (E.(‘. 2.X.1.1). nres = 293 Subtilisin carlsbrrg (subtilopeptidasr A). m-es = 274 Subtilisin RPN (E.t’. 3.4.21.14). nres = 275 Trypsin (XT) (E.t‘. 3.4.21.4), nres = 223 Thrrmitase (E.t‘. 3.4.21.14). nres = 279 Triacglglycerol acylhydrolase (E.V. 3.1.1.3). nres = 265 Trvpsinogen Bovine (bos t~aurus) pancreas. nres = 222 Thaumatin i, nres = 267 Anhydro-trypsin (E.(‘. 3.4.21.4). nres = 223 Aspartate aminotransferase (E.t’. 2.8.1. I ). nres = 396 Actinidin (sulthydryl proteinase). nreh = 21X Acid proteinase (E.t’. 3.4.23.7).penicillo~~epsirl, nres = 323 Acid proteinasr (rhizopuspepsin) (E.(‘. 3.4.23.6). nres = 325 Beta-lactamase (penicillinase) (E.t’. 3.5.Z.H). chain A. nres = 260 (‘arbonic anhvdrasr form b. nres = 256 t’hloramphenicol acetgltransferase (E.t’ 2.3.1.2~). nrrs = 213 (‘oncanavalin a. nrcs = 237 t’ytotoxic tlymphocytSe proteinase i. nreh = 227 Immunoglobulin fab. chain L Human. nres = 216 Immunoglobulin fab. chain H Human. m-es = 229 IgA fab fragment (j539) (gala&an-binding). chain L, nres = 213 1gA fab fragment (j539) (galactan-binding). chain H. nres = 220 D-(hhctOSe D-(:lWOSe binding protein ((X+BP). nres = 309 Ape-D-t:lyceraldehyde-3phosphate dehydrogenase. nres = 334 Glutamine synthetase (E.C. 6.3.1.2). chain A. nres = 468 IgGl fab fragment (hyHEL-5) and lysozyme. chain L. nres = 212 IgGI fab fragment (hyHEL-5) and lysozyme. chain H, m-es = 213 Human class i historompatibility antigen aw 68.1, nres = 270 lmmunoglobulin gl Human (homo sapiens) myrloma. nres = 216 Leucine-binding protein (LISP). nres = 346
173
S. I). Pickrtt
et al.
--.-___..-
..~.._...
_.. --
Table 2 (continurd) ‘LI)X YMC‘PI 1 3MC‘PH 2MEVl 3MEV2 IMEV3 2PFKA SPHH ZPL\‘ 1 %PLV% PPLV3 2PRK %SNlE 2TUV.S 2TMAA 3(‘4% 3( ‘LA 3C’PP YES’1 3FAUL 3FARH 3GAPA 3(:APB X:RS YHFML 3HFMH YHLAA 3MCY:l 3PC:K 3mnl
3RP2A 3TS1 I 44 PE 4FABL 4FARH 4MI)H.A 4PTI’ 4RH\‘l 4RHVZ 4RHV3 4SIS\‘(’ .5.4(‘S 5AI)H 5( ‘Pr\ 51’EP
7TLN XADH SATI A XC‘A\T A 8Ll)H O.APIA SPAI’
dehydrogenase (ICC‘. 1.1.1.27). nres = 331 Immunoglohulin mcP(‘603 fabphosphocholinr complex, nres = 220 Immunoglohulin mcP(‘603 fall-phosphocholile complex. nres = 111 Mmgo encephalomyocarditis virus coat protein. vhain 1, nres = %6X Mengo encephalomyocarditis virus coat protein. vhain 2. nrex = 249 Mengo enrephalomyoc,arditis virus coat, protein. vhain 3. nrcs = 231 Phosphofructokinase (E.(‘. 2.7.1.11). chain A, nres z 301 p-Hydroxyhenzoate hgdroxylasr (PHISH). nres = 391 Poliovirus (type 1. Mahoney strain) chain 1. nres = 302 Poliovirus (type I. Mahoney strain) chain 4. nres = 272 Poliovirus (type 1, Mahoney strain) chain 2. nres = 238 Protrinase k (E.(‘. 3.4.21.14). nres = 179 Suhtilisin nova (E.(‘. 3.4.21.14). nrcs = 27.5 Tomato bushy stunt, virus. chain A. nres = 387 Tropomyosin, chain A, nrcs = 284 (:arhonic anhydrasr II (carbonate dehydratase), nres = 2.56 Type III rhloramphenicol acrtyltransferasr. nrcs = 213 (‘ytochrome p45Ocam. nrcs = 405 Native rlastase (E.C. 3.4.21.1 I), nrcs = 240 Lambda immunoglohulin fah(prime). chain L Human. nres = 208 Lambda immunoglohulin fab(prime). chain H Human, nres = 220 (Tataholite gene activator protein. nres = 208 (lataholite gene activator protein, nres = 205 Glutathione reduvtase (E.(‘. 1.H.4.2). nrrs = 47X IgGl fah fragment (hyHEL-IO) and lysozyme, nres = 114 IgGl fah fragment (hyHEL-10) and lysozyme, nres = 215 Human class i histocompatibility antigen a2.1, nres = 270 Immunoglohulin lambda light chain dimer (MN). nres = 416 Phosphoglyceratr kinasr (E.(‘. %.7.%.3), nres = 41.5 I’hosphoglycerate mutase (IS.{‘. d.7.5.3), nres = 230 Rat mast cell proteasr II (RMC‘PII), chain A. nres = 224 Tyrosyl-transfer RNA synthetase (E.C. 6.1.1.1). nres = 211 Avid proteinasr (E.(‘. 3.4.13.10). endothiapepsin, nres = 330 4-4-20 (igGPakappa) fat) fragment-fluorescein, nres = 219 4-S-20 (igG2akappa) fah fragment-fluorescein, nrrs = 216 (‘yt,oplasmic malate dehydrogrnase (E.C. 1.1.1.37). nres = 333 Beta trvpsin, diisopropylphosphoryl inhibited. nres = 223 Rhino&us 14 (HRV14). chain 1 Human, nres = 273 Rhinovirus 14 (HRVl4). chain 2 Human, nres = 255 Rhinovirus 14 (HR\‘l-I), chain 3 Human , nres = 236 Southern bran mosaic virus (vat prot,ein, chain (‘, m-es = 22% Aconitase (E.(‘. 421.3). nrcs = 378 ;Ipo-liver alcohol dehydrogenase (E.(‘. 1.I .I. 1). nrcs = 374 (‘arhoxypeptidasr aalpha (VOX) (E.(‘. 3.4.17.1). nres = 307 Pepsin (I!,.(‘. 3.4.23.1), nres = 316 Thermolysin (ICC’. 3.4.24.4). nres = 316 Ape-liver alcohol dehydrogrnase (E.C. 1.1.99.X). nres = 374 Aspartate c~arbamoyltransferasr, nres = 310 (‘atalasr (E.(‘. I. 11.1 .S) chain .I. nrrs = 506 M4 apo-Lactate dehydrogenasr (I+J.(‘. 1.1. I .27), nres = 329 Modified alpha1 -Antitrypsin, nres = 339 Papain (E.(‘. 3.4.22.2) cys-25 oxidized, nres = PI1 Ape-lactote
(b/m),-barrel proteins are at the top of the list and the codes are shown in hold type. nrrs. number of residues.
score from the best-path through each distance-level matrix is subject to a low score cutoff, below which the path is ignored. Taylor & Orengo (1989uJ) have suggested suitable values of A and B as 50 and 2, respectively. The low score cutoff is given empirically by the where N is the length of the shorter formula (ZOO x N)“*, sequence. The multiple alignment stage of the program is based upon the algorithm of Barton & Sternberg (1987) developed for the rapid multiple alignment of sequences. Each pairwise structural alignment is rank ordered according to the root-mean-square (r.m.s.) deviation of a least-squares fit of the 2 structures (McLachlan. 1979), using the alignment to assign the equivalent C” atoms. The alignment with the lowest r.m.s deviation is taken as t,he base alignment and the other structures are added
sequentially to this alignment in an order determined by the r.m.s. deviations. The resulting multiple alignment may be refined further by taking each structure in turn and adding it to the alignment of the other structures. The score at a particular position in the multiple alignment is obtained by averaging the scores from the relevant residue-level matrices, calculated during the pairwise alignment stage. This procedure is analogous to the use of a mutation data matrix in multiple sequence alignmentas, except that each pair of structures has its own matrix associated with it. The use of the previously calculated residue-level matrices removes the necessity to recalculate and average the distance-level matrices and means that the multiple alignment stage of the program takes seconds as opposed to minutes (or hours) for each pairwise structural comparison.
175
(P/a)*-Barrels A Fortran computer program was written to perform the multiple alignments as described above, incorporating the basic method of Taylor & Orengo (1989a,b) for the pairwise comparisons. The method of windowing as described by Kruskal & Sankoff (1983) was used. Structural data is read from files in standard Brookhaven format. The Needleman & Wunsch (1970) algorithm was used to trace the best-path through the matrices. A pairwise comparison of 2 proteins of length 250 residues with a window parameter of 50 takes approximately 70 minutes of central processor unit (c.P.u.) time on a &n Spare IPC workstation. The full multiple alignment of 5 (/l/a),-barrel proteins requires 16 h and 45 min of r.p.u. time, adequate for our needs and no attempt’ was made to speed up the process. (v) Residue-residue
contacts
Side-chain contacts were calculated for all residues in the P-strands and cc-helices comprising the @/a),-barrel. van der Waals’ radii were taken from the OPLS forcefield of Jorgensen & Tirado-Rives (1988). These values were multiplied by a factor of @8 to include in the contact list only t)hose side-chain at’oms with appreciable van der Waals’ overlap. ((1) &vpm~e
database searching
A database of sequences was derived to test for possible sequence motifs for the (fi/a)s-barrel fold. Sequences were taken from the PIR database of proteins of known strutture (ICRl,-31). protein sequence-structure database version 2.0, Sept,ember 1990, version V4.5/5.3 of PIR). All proteins of sequence length less than 200 residues were removed as it seemed unlikely that a sequence this short would form a (fl/a)s-barrel. The sequences of several (p/a)s-barrel proteins for which there were no data in the Brookhaven database or which were not in the NRL-31) database were also included (mandelate racemase (MAICR), ADI’SGP, cyclodextrin glycosyltransferase (CDGT). PYK, fructose bisphosphate aldolase (ADRBA), [UPS and PRAI). The final database consisted of 124 sequences, including 17 (P/&-b arrel proteins. A full list is given in Table 2. The codes for the @/a),-barrel proteins are given in bold type. IMPS and PRAI are 2 separate domains of a bienzyme and the combined sequence (UWEC) was also included in the database. Two methods have been used to scan the database, that of Barton & Sternberg (1990) and that of Bowie et ul. (1991). These are desrribed in section 4, below.
3. Multiple Alignment of @/a),-Barrels (a) Multiple
/?-strand 3 (B3) and a-helix 3 (H3) of the barrel domain. Such insertions were removed from the coordinate files prior to the full multiple alignment, whilst retaining single portions of secondary strticture such as the helices in the loops between I%-H4, B&H& H&B6 and B&H8 of TIM. The regions of the structures used in the alignments presented below are listed in Table 1. A preliminary pairwise alignment of TIM and TAA showed that the alignment method correctly inserted a large gap in the TIM sequence about the three additional P-strands of the TAA struct’ure. The multiple alignment of the five (P/a),-barrel proteins is shown in Figure 1. Residues in the P-strands and a-helices comprising the (P/a),-barrel are shown in upper-case and the /I-strands and cc-helices are numbered sequentially from 1 to 8. Residues which form other B-strands or a-helices are shown in italic upper-case. The alignment was generated as described in Methods with a window parameter of 50 and values of A = 50. R = 2 in equation (1). A penalty of 5 was assigned for starting a gap. The initial multiple alignment was refined by three further cycles, no changes occurring after the second cycle. It is evident from Figure 1 that t’he multiple alignment procedure has correctly aligned the eight b-strands and eight a-helices of t’he barrel in each st,ruct,ure, inserting gaps where necessary about other sections of secondary structure. The global superimposition of GOX, RUB, XIA and TAA onto TIM, using the alignment to identify the equivalent) residues, is shown in Figure 2. The fi-strands were aligned correctly with respect to the pleating of the p-sheet and the a-helices also superimpose well. Tn general, there are no gaps in the core regions of the barrel, i.e. the a-helices and p-strands comprising the barrel. A notable exception is the gap about Pro86 in b-&and 3 of XIA. Henrick et al. (1989) ha.ve noted that this proline residue produces a kink in t)he b-strand in XIA and hence I he insertion of a gap here in the other sequences is reasonable (see below). A further indicator of t,he reliability of the alignment is that residues involved in strand-helix contacts, i.e. those residues which may help to stabilize the fold, tend to be aligned in both the a-helices and the P-strands as discussed in t,he following sections.
alignment
A multiple alignment of five (b/u),-barrel proteins, TIM, GOX, RUB, XIA and TAA was generated, based on structural criteria described in Methods, above. The lengths of the structures studied vary between 247 residues for TIM and 478 residues for TAA. TAA has two domains with the (a/a),-barrel in domain A (residues 1 to 380). Domain B is an antiparallel B-sandwich structure and was removed from the structure prior to alignment. Often, there are insertions of other small domains or regions of secondary structure in the barrel domain of these proteins. For example, in TAA there is an insertion of three p-strands between
(h)
Alignment
of the JLs-trands
The alignment of the b-strands is shown in more detail in Figure 3. Only b-strand residues are shown. Residues involved in strand-helix contacts, calculated as described in Met’hods, are shown in upper case and those pointing into the /?-barrel core and making cont’acts with non-nearest neighbour B-strands are boxed. The residues in GOX, RUB and TIM identified by Lesk et al. (1989) as aontributing to the core of the P-barrel are shown boxed in upper case. We have analysed TAA and XIA in a similar fashion and the residues contributing sidechains t,o t.he core of the B-barrel in these two
6pl TIM GOX RUB XIA TAA
TIM GOX RUB XIA TAA
TIM GOX RUB XIA TAA
14
bbbbbbb a ---ap-----------rk-FF"GGNWk--mng-K-------------------RKSLGELIHTLDGAkl ---ttil-gf---k--is-mpIMIApt--amq-km-------------ahp---eGEYATA~S~-svnISALWKVLqrpevdqqLWGTIikpklgl-r-------------------PKPFAEACHAFWlq----vqp-------t--pa-dHFTFGLW--TVgwtq--a---dpfgva-t~a--nldPVEAVHKLAEL--atpadw---------r--sqSIYFLLtdrfartdgsttatcntadqkycqqtwqGIIDKLDYIQ~-bbbbbh 8 18 28 38
24
aaaaaaaaaa 48
p2 44 a2 54 bbbbbbb aaa aaaaaaaaa saDTEWCGaPSI---------------------------------yLDFARQKL------da-k --gt--iMTLssw---------------a-------------tss----~~~“As~-------gp----g-dFIKndep---------------q-----g--nqpfapLRDTIALVADAMRRAQDETgea --ga-YGITfhDN---------------D-----L--Ipfdat~-EREKILGDFNQALADTq---gf-TAIWITPvtaqlpqdcaygdaytqywqtdiyslnenygt---a~L~LSSALHE-Rq-bbbbbbb aaaaaaaaaaa 63 13 83 93 103 p3 63 70 82 a3 aaaaaaa aa bb bb iGV-AAqn-c------------ykvpkgaftq-------e-iSPAMIKD-IG--a---AWVILg-HSER giR-FFQLyv------------ykdR----NV--------VAQLVR~ER-aq--f---kAIALT-vdtp kLF-SANit-------------addP----FE-------IIARGEYVLE-TFqenashVALLVD-gyvlKVPMVttnlfshpvfkdhqftsndR----SIRRFALAKVLHNIDLAAE-Mq--a---ETFVMw-ggre MYL-MVDVvan-tvslp-dld-Ttkd----W----KNEWYD~GSLVsnys--i---dGLRIDtvkhaaaaaaaaaaaa bbb bbbb aa 116 173 181 193
0.1
aaaaaaaaaaaaaa
a p4 bbbbb
95
bbbbb 208
TIM GOX RUB XIA TAA
105 a4 115 123 p5 140 a5 150 aa aaaaaaaaaaaaa aaaaaaaaaaaaaaaa bbbbbbbb RH----VFq--e----SDELIGQKVAHALAEG--lGVIACIGEKLD--EREAGIT-EKVVFQETKAIAD -------rs--l----sWKDVAWLQTIt---s--lPILVKgv--i-------------t-AEDARLAVQ -----------a----GAAAITTARRRfp--d--nfLHYHraghgavtspqskrq-y-tAFVHC~RL gseydqskdLAAALDRMREGVDTAAGYIKdkqynlRIALEpkpnep-----rgdiflptVGHGLAFIEQ -----------v----qkdFWPGYNKAa---g---g--"yC~GE"~d-g-------------dpAyTCp-yqn aaaaa a aaaaaaaa bbbbbb 212 222 227 242
TIM GOX RUB XIA TAA
160 p6 169 176 a6 186 202 I37 a a aaaa bbbbb aa bbbbbbb bb aaaaaaaaaaaaaa NVkdwsKVVLAYE-PVwaiqt---gktaTPQQAQEVHEKLRG--W-L-KTHVSDAVAVQSRIIYG---h-g-a--aGIIVs-nhgarqld--yvpa----TIMALEEWK--A-A--A-A---------qgripVFLD---Q-g-a--SGIHtg-t-----------ss----DRAIAYMLkacTPIIS---l-ehgdIVGLNPE-TqHEQMA----G-ln----FTHGIAQALW--a-e---------k-lFHIDLngqrg v---m--dGVLNYPIYYPLLNILfkstsqsmDDLYNMINTVKS--d-c-----p--dstlLGTFVen--bbbbb bbbbb aaaaaaaaaaaa 247 257 267 277 287
TIM GOX RUB XIA TAA
217 a7 224 228 p3 aaaaaaaaa a bbbbb a ---gs"T--------GGNCKELAS-Q--h--h------d"-DGFL"GG-AS---------------L-K-----ggvrr-------GTD”FKALA-L---------ga-AG~L-~ ---Ggmn--------alrMpGFFENL--g------nanVILTAqgqaf----------------qh--i ikydqdlvfghgd--LTSAFFTVDLLENqfpngqpkyTGPRHFDy-kp--s------------rtdq-Y ---hdnprfasytndIALAKNVAAFIILn------d--GLPIIYAgqeqhyaqqndpanreatwlsgyp aaaaaaaaaaaaa bbbbbbb 301 3 1 1. 321 324 334
TIM GOX RUB XIA TAA
a
344
a8 244 aaaaaaaa a me--pEF"DIIN-Ak-h---------m---------e--AEGEAGVKKVLQMMRDEFELTMALSgc-rslkeisrshiaadwd dgP"AGARSLRQAWQAWRD------g-----------------DGVWDSAKANMSMYLLLKERLAFRAdpe--5------------tds-ELYKLIASANAIRNY-AI-Skdt-gfvtyk---------aaaaaaaaaaaaaaa aa a 363 372 380
Figure 1. Multiple alignment of5 (,9/a)s-barrelproteins. a-Helices and p-strands in TIM and TAA are indiwtrd by the symbols a and b, respertively. The residues comprising the cc-helices and p-strands of the barrel are shown in upper casv type. Other regions of secondary struct,urr are shown in upper case italics. The numbering above the sequences refks to TTM and that below to TAA.
177
(fl/a)s-Barrels
Figure 2. (:lobal superposition of GOX (green), RUB (white). XIA (yrllow) and TBA (magenta) ou ‘TTY (I&), using the multiple alignment (Fig 1) to define equivalent (1” atoms. Only the a-helices and P-strands of the (fl;x),-barrel are shown Figs were tlrawn using the program PREP1 (written 1)). S.A. Islam).
at-r indicated similarly. Both TAA and XIA may be added to the same class as TIM and RUB where odd-numbered j-strands contribute one side-chain to the core and even-numbered j-strands contribute two side-chains to the core. It should be no&d. however. that for XIA the assignment was not straight’forward and it was necessary to include Pro288 as contributing t’o the packing. This residue is one posit’ion before the start of b-strand 8 at Arg289 as defined by Henrick et al. (1989). R,esidues which cont’ribute side-chains to the t,hrer layers forming the core of the b-barrel are aligned. except b-strand 7 where the XIA and TAA sequences appear to be displaced by two residues. In this instance. the alignment of XIA and TAA corresponds to the alignment of residues packing with a-helices. Each b-strand has a core of at least three. and in most cases four, consecutive residues, within this core are residues packing against the a-helices with neighbouring residues contributing side-chains to the p-barrel interior. Wilmanns et ~2. (1991) have found a similar region of four central residues in two other (/l/cr),-barrel enzymes, the two @/a),-barrel domains of the bienzyme (phosphoribosyl)-anthranilate isomerase-indole glycerolphosphate synthase and the r-subunit of tryptophan synthase. though
structures
did not explicitly identify the contar+ of the residues with side-chains pointing away from the /?-barrel. The insertion of a gap about Pro86 in XIA p-strand 3, may be explained in more detail. Roth Pro86 and Met87 in XIA cont,ribut’e side-chains to the core of the b-barrel. Examination of the structure revealed that the proline fills the gap left by Gly49, fl-strand 2, which should c:ont,ribute a sidechain t,o the core. Val85 and Val88 pack against an a-helix and are aligned with residues of similar function in the other four stSructSures. It would appear, therefore, that Pro86 fulfils an important structural role. In this regard. it is intjerestjing that) in five xylose isomerase sequences from different species (Henrick et al., 1989), the t#hree with a proline at the position of Pro86 also have a glycine at an equivalent position to Gly49. The other two sequences have a Phe in place of (:I~49 and no proline. they
(c) Alignment
of a-helices
and strow&h&x
packing
The alignment of the cc-helices is shown in Figure 4 in which the contacts of the residues are detailed. Only residues in the a-helices are shown. Side-chain
TIM GOX RUB XIA TAA
TIM GOX RUB XIA TAA Figure 3. Alignment of the p-strands showing the residues involved in packing the core of the b-barrel (boxed) and those making contacts with a-helices (upper case). Boxed residues in upper case indicate residues involved in packing the /Sharrrl designated following Lesk et al. (l!H9) (see the text).
51 a2
TIM GOX RUB XIA TAA
24 al RksLgeL;LHTLDgA gE;YA~RAAs~ PKpEAEaCHaEEl EVEmHKLAeL giLIdkLDXLQ
TIM GOX RUB XIA TAA
107 a4 117 sDel;LgQKYAHaaEg HkDyaWLQtI gAABl;t V&-R 1aaaLdRMREgYdTBBgYLK EEpgXHkA
TIM GOX RUB XIA TAA
TPQqBQeueKmg--W-L-kthV nm&EEYJLk--A-A dRAIBYMLT EThg;LAQALw ddLynmIntYk
TIM GOX RUB XIA TAA
243 a8 PEEVdun-B gEagyKkYLwdeEelTMalS pvagBrsmqAuqaWrd tdgvWdsBkamsMyL1LKeRALaFRa EJ,YkLLasA&IBny-ai-s
180
Figure 4. Alignmrnt upper chase. Residues styurnw
a6
YLD FARqKL YEWA LRDTILALYADAMrRAqdEt aa-eRekILgDENqALaDT dLkALSsAue-R
82 a3 sEAmJzKd-Ig nv-------YAqmRmeR fE-------IIaRgEYne-TE siRrfaLakYLhmDJ&aE-M vV----KneEIYdmgsLV
140 a5 150 LC-W!JY'QEZ~LA~ AEDBRLD AFvzwRlQ
YghgLaELeQ WCQ-Y
190
217 a7 ggnC:KELAs -Q gTdJ/FK.&a-L WgEEENL LtsAFFxVDLmN IalAKnVAAFIZL
cr-helicw art’ shown in of thr a-hrliws of t.he (p/cc),-barrel. Residues in contact with neighhouring in cwntact with the j&barrt~l are shown in upper case and undrrlined. Iljumhrring rrfws to thta TIM
are depicted as follows: upper case underlined for residues making contacts with P-strand residues, upper case for contacts with neighbouring a-helices (excluding residues making P-strand cont,acts) and lower chase for residues making only intra-helix contacts. The csontacts of aligned residues tend to be conserved at cerbain positions. For example, in a-helix 4 there is a region of seven positions. from TTM residues TIelO to Hisll.5, in which the caontacts are conserved. Only in a-helix 2 is it cliffc:ultj to identify such a region. This may be due to the particularly short length of this helix in GOX.
contacts
just) five residues. The patterns of the contact residues indicates that the a-helix/B-strand packing in the (/3/a)s-harrel enzymes is analogous to that observed in ot,hrr CC/~ proteins (,Janin & Chot,hia. 1980; Cohen et (zl., 1982) where residues i+ 1. i+l. i+5 and i+ 8 of the a-helix form a diamond around a central p-strand residue.
Table 3 I’oluwws of side-chains in the B-sheet/a-helix interface
(‘ode TIM (X)S 111'1s SlA 'I'AA
a
OL)
Volume (A3) 11 (n)
Total
01)
31% 2429 2866 2694 3463
(37) (3.5) (30) (32) (34)
5077 170x 1151 l!XKoO 1640
5262 4137 4017 4594 5103
(60) (5%) (46) (51) (53)
(23) (17) (16) (19) (19)
Thr volumes of sidr-chains in the fl-sheet/a-helix interface were wlrulated using thr valnrs of Lcnk & ('hothia (1980) from thr residues indicated in Figs 3 and 4. Thr numbrr of residues is slrcnvn in parentheses.
Volume difference (d3) Figure 5. The difference in volume (largest-smallest,) at the 34 positions in the multiple alignment where the B-sheet/cc-helix interactions are conserved in at least 4 of the 5 structures
179
(/?/a),-Barrels
Table 4 Volumes of side-chains
packing /CbarreE
(kde
Lesk
(n)
TIM (iOX RlTR XIA TAA
787 1088 X67 709 9.52
(12) (12) (12) (12)
Volume (ii3) Total (n) 1359 1088 1249 1442 1571
(12)
variation of 56 A” among homologous amino acids in the core, but the volume of the hydrophobic core remains almost constant at 3180A3 with a rootdeviation of just 15 A” (Lcsk 6t mean-square Chothia, 1980; Lim & Ptitsyn, 1970). In contrast, the imposed constraints by no there are @/a),-barrel topology on the volumes of the residues at eit)her a global level (the t’otal volume of the sheet/helix interface and t,he b-barrel (Gore) or at a residue level.
into the core of the
(18) (14) (16)
(21) (18)
Average 7.55 77.7 78.1 AX.7 87.3
The side-chain volumes of residues indicated in Fig. 3 for the classification of Lrsk et al. (1989) and the total volume of sidechains parking into the P-barrel are given.
(d) Residue volumes Table 3 details the total volumes of residue sidechains in the sheet-helix interface for each protein, calculated from the contact residues indicated in Figures 3 and 4. Volumes quoted by Lesk & Chothia (1980) have been used. Figure 5 is a histogram of the volume difference (largest -smallest residue) at the 34 positions in the alignment where the sheet-helix interactions are conserved in at least four of the five structures. The peaks in Figure 5 around 75 A” and 115 A3 relate, for example, to residue variations of Ala to Ile (77 A3) and Ala to Phe (111 w3). The volume of the side-chains of residues contributing to the core of the @/a),-barrel are given in Table 4. The total volume of all side-chains pointing into the B-barrel interior varies between 1088 A” in GOX and 1571 A” in TAA, though, interestingly, the average volume of the residues in TIM, GOX and RUB is very similar. The total volumes of sidechains contributing to the core as defined by Lesk et al. (1989) vary between 709 A3 in XIA and 1088 A” in GOX. These results may be compared to those obtained for the globin family where individual sequence changes do not maintain residue size, giving a mean
4. Definition
of Sequence Motifs
The multiple alignment has permibted the identification of core regions within the a-helices and b-strands of the (p/a)s-barrel for which the sidechain contacts are conserved. This prompts the question of whether it is possible to use these regions to define a sequence motif that could identify other enzymes adopting this fold. However, it is obvious from the discussion above that these positions can accommodate a wide range of amino acid substit’utions and hence it is necessary to utilize the information contained within the multiple alignment’ in a general manner. We have used two approaches to address this problem, generating t#he six motifs listed in Table 5 and described in more detail below. The program AMPS (Barton & Sternberg, 1990; Barton, 1990) scans flexible patterns drrived from a multiple alignment against a sequence database. A patt’ern consists of a series of elements, representing the residues at a particular position in the multiple alignment, separated from neighbouring elements by a defined gap length. The gap Ien@h may be fixed (0,1,2 etc) or variable (e.g. .5 to 10 residues inclusive). Three patterns have been defined: ( 1) AMPS-1 : using as a pattern the residues of TIM in the (B/cl)*-barrel secondary structure, defined as given in the PDB file and indicated in Figure 1. (2) AMPS-T,: using the sequences of the five
Table 5 DeJinition Motif
Proteins
AMPS-1
TIM
AMPS-5
TIM, GOX, RUB, XIA, TAA TIM, GOX, RUB
AMPS 3 BOWIE-1 ISOWIEP5 HYBRID
-5
TIM TIM, RUB, TAA TIM. RUB, TAA
GOX, XIA, GOX. XIA,
of sequence motifs
Regions used
Length
Scoring scheme
PDB secondary structure Barrel P-strand and a-helical regions
159
250PAM Dayhoff matrix 250PAM Dayhoff matrix
Barrel p-strand and or-helical regions Whole protein Whole alignment Barrel p-strand and cc-helical regions
139
Method AMPS AMPS
140
250PAM Dayhoff matrix
AMPS
247 4.54
Environments Environments
3D profile 31) profilr~
139
Environments
AMPS
The 6 sequence motifs derived for scanning against the databases are described according to the proteins used, the regions used, the total length of the motif, the scoring scheme utilized when scanning the database and the method.
i-11
1
>TIM
2
.GOX
3 4 5 6
>RUB >XIA ZTAA >PYK
1 2 3 4 5 6
>TIM x0x >RUB >XIA
1 2 3 4 5 6
>TIM >GOX >RUB >XIA >TAA >PYK
1 2 3 4 5
>TIM >GOX >RUB >XIA >TAA
6
>PYK
1
>TIM
>TAA
>PYK
101 101 101 101 101 101
a2 a3 82 a4 VCGAPSI LDFARQK ISPAMIKDIG AWVILG IMTLSSW VEEVAST QLVRRAEPAG KAIALT FIKNDEP IALVADA KLFSANI ARGEYVLETF VALLVO GITFHDN REKILGD LKVMVTT HNIDLAAEMG ETFVMW AIWITPV ADDLKAL MYLMVDV DWVGSLVSYS DGLRID PIRYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKWLTLKITLDNAYMEKCDENVLWLDYKNICKWEVGSKVYVDDGLISLLVKEKGADFLVTEVENGG bbbbbbbb bbbbbbb aaaaabbbb bbbb aaaaaaa aaaaaa aaaaaaaaaaaaa bbbbbbb bbbbbbb D3
a5 a4 !35 P6 201 SDELIGQKVAHA LGVIACIG FQETK IADN VVLAYEP 201 SWKDVAWLQTIT LPILVKGV AEDAR AVQH AGIIVSN 201 GAAAITTARRRF NFLHYHRA FVHCK ARLQ SGIHTGT 201 MREGVDTAAGYI LRIALEPK GHGLA IEQL. VGLNPET 201 QKDFWPGYNKAA VYCIGEVL AYTCP YQNV DGVLNYI 201 SLGSKKGVNLPGAAVDLPAVSEKDlQDLKFGVEQD~KFGVEQDVD~FASFIR~SDVHEVRKV~GEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEI bbbbb bbbb bbbb bbb aaaaaaaaa bbbb aaaaaaaaaaaa bbbbbb aaaaaaaaaaaaa bbbb aaaaaa a3 a4 a5 P4 !35 P6 a6 P7 AQEVHEKLRG QSRIIYG 301 TIMALEEWK 301 RIPVFLD DRAIAYMLTQ ACTPIIS 301 FTHGIAQALW LFHIDLN 301 LYNMINTVKS 301 LLGTFVE 301 PAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELVRG aaaaaaaaaaaaaaaaaa aaaaaaaaaa aaaaaaaaaaaaaaaaaa bbbbbb aaaaaaaaaaaa bbbbbb a8 a7 a6 87 Pa
a7 2
>GOX
3 4 5
>RUB >XIA >TAA
6
>PYK
1
>TIM
2
>GOX >RUB >XIA
(I-
1 FFVGGNWK 6SL;ELIH'TLLG:: 1 MPIMIAPT KGEYATARAASAA 1 LWGTT IK YPFAEACHAFWL'; 1 DHFTFGLW LDPVEAVHKLAEL 1 SQSIYFLL QGIIDKLDYIQGM 1 SKPHSDVGTAFIQTQQLHADTFLEHMCRLDIDSPPITA~TGIICTIGPASRSVEILKEMIKSGMNVARLNFSHGTHEYHAETIKNVRAATESF‘A aaaaaaaa bbbbbbbb aaaaaaaaaaa bbbbb adaaaaadaaaaaaaaaaaa a1 a? bl 82
401 401 401 401 401 401
BE
DGFLVGGAS GGNCKELAS Q AGVFIGRPV GTDVFKALA L VILTAGGAF ALFU4PGFFE L GPRHFDYKP LTSAFFTVD L GLPIIYAQE IALAKNVAA I SSHSTDLMERMAMGSVEASYKCLARALIVLTESGRSAHQVARYRPRAPIIAVTRNHQTARQAHLYRGIFPWCKDPVQEAWAEDVDLRVNLAMNVGKARG aaaaaaaaaaaaaaaaaaa bbbb aaaaaadaaaa bbbbbb aaaaaaaaaaaaa bbbbbb aaaaaaa aaa
a8 3 4 5 6
>TAA >PYK
501 PEFVDIINA 501 AGVKKVLQM 501 AGARSLRQW 501 DSAKANMSY 501 ELYKIJASN 501 FFKHGDWIVLTGWRPGSGFTNTMRWPVP bbbbbb bbbbb
Figure 6. Alignment of mot.if AMPS-5 with (/l/a),-barrel protein pyruvate kinase (PYK). The secondary structure of I’YK is shown below t~hesequence: a. a-helical residues: b. P-strand residues. The b-strands and x-helices comprising thr (/$/a),-barrel domain arr indicated
aligned structures. taking as elements the positions occupied by residues from all five sequences in the P-strand and a-helical regions, Phe6 to Lysl3. Lys19 to Ala31, Va140 to Ile46 etc. (numbers correspond to the TIM sequence, Figure 1). (3) AMPS-3: based on AMPS-5 but using only the TIM, GOX and RCB sequences. Wilmanns et al. (1991) have suggested that these three proteins are related by divergent’ evolution. The same gap parameters were employed in all three patterns. Gaps between adjacent elements in t,he same a-helix or P-strand are set to zero and flexible gaps defined between secondary structure units, 1 to 25 in the CC/~loops and 1 to 90 in the p/cc loops. If there is a gap of one position in one or more of the aligned structures, for example about the
proline in XTA, p-strand 2, a flexible gap of zero to one is defined. A 250PAM Dayhoff matrix was used to score a pattern against the sequences in the database. The pattern AMPS-5 is shown in Figures 6 and 7 aligned to pyruvate kinase and glutathione reductase. As an alternative procedure, the recently described method of Bowie et al. (1991) has been implemented. The method can be used with a single probe st’ructure or the probe can be derived from a multiple alignment of several structures. The residues of a t,est protein of known structure are allocated to one of 18 environment classes according to the accessibility, polar fraction and local secondary structure of the residue. Bowie et al. (1991) have derived a scoring matrix of environ-
IX1
(b/a)*-Barrels Cl1 P2 Pl KSLGELIHTLDGA VCGAPSI FFVGGNWK 1 IMTLSSW EGEYATARAASAA MPIMIAPT 1 KPFAEACHAFWLG FIKNDEP LWGTIIK 1 GITFHDN LDPVEAVHKLAEL DHFTFGLW 1 QGIIDKLDYIQGM AIWITPV SQSIYFLL 1 1 ACRQE~QPQGPPPAAGAVASYDYLVIGGGSGGLASARRAAELGARAAWESHKLGGTCVNVGCVPKKVMWNTAVHSEFMHDHADYGFPSCEGKFNWRVIK aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaa bbbbbbbbb aaaaaaaaaaaaaaa bbbbbb
1 2 3 4 5 6
>TIM >GOX >R"B >XIA .TAA >3GRS
1 7 3 4 5 6
>TIM .r.OY >RUB >XIA >TAA >3GRS
a3 a2 84 P3 AWVILG ISPAMIKD IG LDFARQK IGV AAQN 101 KAIALT 1 "I GIR FFQL QLVRRAER AG VEEVAST VALLVD KLF SAN1 ARGEYVLE TF IALVADA 101 ETFVMW LKV MVTT HNIDLAAE MG REKILGD 101 DGLRID ADDLKAL MYL “VDV DWVGSLVS YS 101 101 EKRDAYVSRLNAIYQNNLTKSHIEIIRGHAAFTSDPKPTIEVSGKKYTAPHILIATGGMPSTPHESQIPGASLGITSDGFFQLEELPGRSVIVGAGYIAV bbbbbbb aaaaa bbbb bbbb bbbbb bbbbb aaaaa aaaaaaa aaaaaaaaaaaaaaaaaaaaaa bbbbb bbb
1 2 3 4 5 6
>TIM >GOX >RUB >XIA >TAA >3GRS
a5 cl6 a4 I36 I35 201 SDELIGQKVAHA LGVIACIG FQETK IADN WLAYE P AQEVHEKLRG 201 SWKDVAWLOTIT LPILVKGV AEDAR AVQH AGIIVS N TIMALEEWK 201 GAAAITTAkRF NFLHYHRA FVHCK ARLQ SGIHTG T DRAIAYMLTQ 201 MREGVDTAAGYI LRIALEPK GHGLA IEQL VGLNPE T FTHGIAQALW 201 QKDFWPGYNKAA VYCIGEVL AYTCP YQNV DGVLNY I LYNMINTVKS 201 EMAGILSALGSKTSLMIRHDKVLRSFDSMISTNCTEELENAGVEVLKFSQVKEVKKTLSGLEVSMVTAVPGRLPVMTMIPD~CLLWAIGRVPNTKDLSL aaaaaaaaaaaaaaaa bbbbbbbbbbbbbb bbbbbbbbb bbbbbbbbbbbbbb bbbbb aaaaaaaaa bbbbbb
1 2 3 4 5 6
>TIM >GOX Z-RUB >XIA >TAA >3GRS
1
>TIM
2 3 4 5 6
>GOX >R"B >XIA >TAA >3GRS
aa
a7 P7 P8 301 QSRIIYG GGNCKELAS Q DGFLVGGAS 301 RIPVFLD GTDVFKALA L AGVFIGRPV 301 ACTPIIS ALRMPGFFE L VILTAGGAF 301 LFHIDLN LTSAFFTVD L GPRHFDYKP 301 LLGTFVE IALAKNVAA I GLPIIYAQE 301 NKLGIQTDDKGHIIVDEFQNTNVKGIYAVGDVCGKALLTPVAIAAGRKLAHRLFEYKEDSKLDYNNIPTWFSHPPIGTVGLTEDEAIHKYGIENVKTYS bbbbb bbbbbaaaaa aaaaaaaaaaaaaaaaa aaaa bbb bbbb bbb bbbbbbbaaaaaaaaaa bbbbbb
cl8
401 PEFVDIINA 401 AGVKKVLQM 401 AGARSLRQW 401 DSAKANMSY 401 ELYKLIASN 401 TSFTPMYHAVTKRKTKCVC~KEEKWGIHMQGLGCDEMLQGFAVAV~GAT~DFDNTVAIHPTSSEELVTLR bbb aaaaa bbbbbbbbb bbbbbbbbbb aaaaaaaaaa aaaaaaa aaaaaaa
Figure 7. Alignment of motif AMPS-5 with the cl/p protein glutathione reductase (3GRS). The secondary structure of SGR,S is shown below the sequence: a. a-helical residues: b, p-strand residues.
ment class against amino acid type that allows a position-dependant scoring table or three-dimensional structure profile to be generated. The scoring table gives the likelihood of finding any of the 20 amino acids at, a particular position and can be used by a dynamic programming sequence alignment algorit!hm. Two further motifs were defined for use with this method. (4) BOWIE1: using a 3D profile generated from the TIM structure. (5) BOWIE-5: using the multiple alignment of TIM, GOX, RUB, XIA and TAA, a profile was generated from the environment class of each residue in the five (P/a),-barrel enzymes (Gribskov et al., 1987). Environment classes were generated using the ENVIRONMENTS, written by J. 17. program Bowie and secondary structure definitions taken from the PDB file. For the multiple alignment secondary structure gap penalties were assigned according to the consensus secondary structure, three out of five residues in an cc-helix (or a-strand). The 3D profiles were scanned against the database
using a sequence homology search with secondary structure gap penalties. The program AMPS uses a Dayhoff matrix to generate a position-dependant scoring table for each element. However, it is also possible to input a userdefined scoring table. Thus, in an attempt to improve the discrimination of the (p/a),-barrels. the two methods described above were amalgamated to give the final motif. (6) HYBRID-5: pattern AMPS-5 but utilizing the relevant regions of the 31) profile scoring table derived for BOWIE- 5 in an AMPS scan.
5. Discrimination
of Motifs
Results for the different search methods are given in Table 6. The 20 top-scoring proteins are listed with @/a),-barrel enzymes shown in bold t’ype. The positions of the other (p/a)*-barrel enzymes are shown also. The AMPS package and the 31) profile method use different scoring schemes. The protein sequences defining a particular motif are marked by an asterisk.
(124) 3FABL
(75)
(165) (165) (160) (151) (134) (121) (117) (94)
(166) (166)
(186) (179) (174) (170) (170) (169)
(186) (186)
(1Qw (186)
(196)
(787) (208) (207) (203) (198) (196)
(124) IMCPL
(27) (27) (34) (44) (54)
(74)
GOx* (98.2) XIA* (96.0) TAA* Pw TIM* (88.6) RUB* (81.0) 1FCB (5%4) PYK (50.4) 3PGK (50.2) lDP1 (48%) 2ENL (442) GWEC (440) 3GRS (43.6) CDGT (42.4) ZGLSA (42.0) 2GDlO (41.6) STBVA (41.0) MANR (40.4) 1ABP ww 1CTS (38.0) ADRBA (37.0) 1WSYA (346) 1MLE (346) IGPS (3@4) ADPSGP(27.2) PRAI (23.4)
AMPS-5
search&y
Table 6
(1%4) ZFBJL
(22) (26) (27) (37) (38) (68) (80)
(2U7)
(124) ITHI
TIM* 1CTS 3PGK PYK 2PHH 2AAT 2PFKA ADRBA RUB 3CPP XL4 3GRS GOX FCB 1HMGA lBMV2 8ATlA 1PRCM 4MDHA MANR (TX)) 2ENL (29) GWEC (39) 1WSYA (II) IGPS (48) 1MLE (51) TAA (62) CDGT (75) PRAI (90) ADPSGP
RUB* TIM* GOX* 1FCB CDGT PYK 3PGK GWEC ZTBVA 1DPI 3GRS XIA 8ADH 1ABP MANR 2ENL SPHH 1BMV2 3CPP 1WSYB 1MLE IGPS TAA 1WSYA ADRBA ADPSGP PRAI (208~7) (202.3) (185.7) (92.7) (783) (77.3) (77.3) (76.7) (760) (757) (747) (73.0) (71.7) (69.3) (687) (67.3) (67.3) (66.7) (660) (660) (650) (627) (6%3) (567) (5650) (41.3) (33.7)
BORIE-
AMPS-3
(1490)
( 13440) (5160) (5100) (.5ow (500@ (4880) (4755) (4605) (4490) (4445) (4425) (4365) (4320) (4240) (4220) (4185) (4170) (4150) (4140) (4085) (4085) (3885) (3485) (3460) (3335) (3295) (2980) (2710) (2355)
I
a database of 124 sequences with different
1CTS IDPI TAA* PYK %AAT IFCB 3CPP 1WSYB 3PGK 2ENL GOX* CDGT RUB* YGRS XCATA 2GLSA GWEC 5ACN XIA* 1MLE ADRBA 1WSYA TIM* MANR IGPS ADPSGP PRAI
(124) LPTE
(23) (26) (32) (39) (42) (84) (87)
~motzlfs
BOWIE..R
sequence
(21) (22) (27) (33) (40) (43) (49) (Sl) (60) (76) (-455) (124)
(1890) (1880) (1860) (1820) (1820) (1795) (1770) (1650) (1615) (1525) (1455) (1365) (1230) (1115) (610) (580)
(1890)
(2470) (2175) (2150) (2120) (2120) (2085) (1980) (1950) (1930) (1925)
I PTE
IFCB Gox* 3PGK lDP1 1CTS PYK 1PRCM YCPP CDGT BATlA TAA* GWEC 2AAT BGLSA RUB* 3GRS 8CAT4 2TBVA lBMV2 2LDX 2ENL 1MLE MANR 1WSYA PRAI TIM* ADRBA IGPS XIA* ADPSGP
HYBRID-5
(2.0)
(%I 1.8) (206.8) (185.2) (171.8) (157.9) (1495) (148.2) (144.1) (1407) (137.2) (1354) (1354) (134.2) (133.7) (1335) (1296) (129.4) (1223) (121.2) (118.7) (1162) (112.5) (108.2) (1035) (8%2) (840) (82.1) (81.4) (72.5) (58.2)
A database of 124 sequences. containing 17 (fi/n)s-barrel enqvmes. was scanned using the motifs described in Table 5. The top 20 proteins and the positions of other (/J/cc),-harrel proteins are listed. Ail @cc),-barrel proteins are shown in hold type. The sequences defining the motifs are indicated by an asterisk.
Other um?,barrels
TIM* CDGT 3GRS 1PCB -I-AA MANR ZGLSA IDPI IBMVP XIA PYK 8CATA GWEC RUB 2PHH XADH QAPIA 2AAT 2ENL 1HMGA ADRBA 1MLE Gox IGPS 1WSYA PRAI ADPSGP
Top twent? sequences (wore)
(I) (2) (3) (4) (5) (6) (7) (8) (9) (10) (II) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (26) (34) (53) (68) (70) (101)
AMPS-1
Motif
Results .frorn
183
(/?/cc)*-Barrels
Table 7 The number of overlapping (j?/a)B-b arrel secondary structure elements AMPS-1
TIM GOX RITR XIA TAA 2ENL MANR ADPSGP PYK ADRRA 1FCR 1MLE IWSYA (‘DGT IGPS PRAI GWEC
BOWIE-1
AMPS-3
AMPS-5
a
/l
Total
a
j
Total
u
fi
Total
8 10 00 01 01 2 0 1 0 1 0 2 31 1 2 0 1
8
16* 1 0 1 1 3 0 1 0 3 0 3 4 2 4 0 1
8 8 8 8 8 0 0 22 2 1 4 1 34 0 0 0 1
8 8 8 8 8 0 1
16* 16* 16* 16* 16* 0 1 4 4 1 8 1 7 0 1 2 2
8 8 8 0 0 1 0 21 0 0 6 0 8 1 5 0 1
8 8 8 0 0 0 1
16* 16* 16* 0 0 1 1 3 0 0 12 0 13 1 10 0 1
1 0 0 0 2 0 1 1 2 0 0
2 0 4 0 0 1 2 1
0 0 6 0 5 0 5 0 0
a /l Total 8 1 1 0 0 1 0 11 0 0 1
1 I 0 I 0 1
8 0 1 1 0 1 0 0 0 1 0 1 0 1 0 1
16* 1 2 1 0 2 0 2 0 0 2 1 2 0 2 0 2
HYRRID-5
BOWIE-5 a
p
Total
1 1 1 5 2 1 0 00 21 0 1 3 1 33 I 1 1
0 1 1 4 3 0 0
l* 2* 2* 9* 5* 1 0 0 3 0 1 6 1 6 2 1 1
0 0 3 0 1 0 0
a
p
Total
6 7 1 4 3 0 0 01 0 0 6 0 6 34 1 0 1
7 6 0 3 4 0 1 0 0 4 0 6
13; 13* 1* 7* 7* 0 1 I 0 (1 10 0 12
1 0 0
3 0 1
The alignments obtained with (fi/a)s-barrel proteins, resulting from the scans listed in Table 6, were assessed by calculating the number of (j/a),-barrel secondary structure elements overlapping by more than C50°h with the equivalent portion of secondary structure of the scanned sequence. The sequences used in defining the motifs are indicated by an asterisk.
Considering firstly the results of the AMPS scans, there is some discrimination of (/3/a)*-barrel enzymes from other proteins. The TIM secondary structure motif, AMPS-l, gives ten of the 17 @/ah,-barrel sequences in the top 20 scores (including TIM itself). The discrimination is improved with the inclusion of information from the multiple alignment, AMPS-5, especially with respect to the position of the lower scoring sequences. The lowest scoring (/?/a)*-barrel sequence is 54th out of 124 sequences. The five proteins in the template score highest and the next two sequences are (p/a),-barrel proteins. Twelve of the 17 (P/a),-barrel sequences in the database are in the top 20 sequences. Pattern AMPS-S, comprised of TIM, GOX and RUB, also improves the score of the lowest scoring barrels with respect to AMPS-l, but not to such a great extent, the lowest barrel being at position 80. With all three motifs the sequences in the template score highest. The 3D profile method (Bowie et al., 1991) with TIM as the probe, BOWIE- 1, gives similar results to the AMPS package, with eight (fl/a)s-barrels in the top 20 scores. Using the multiple alignment to generate the probe, BOWIE-5, improves the results with ten (/I/a)s-barrels in the top 20 scores. In this instance, however, the five (/?/a)s-barrels in the probe, namely TIM, GOX, RUB, XIA and TAA, do not score best as the 3D structure profile method tries to match the local environment of the residue rather than specific sequence relations (Bowie et al., 1991). Combining the two methods in HYBRID-5 degrades the performance in terms of the rank of the #/a),-barrel enzymes with only seven in the top 20 scores, including three of the sequences in the motif.
A key aspect of using pattern or homology searches is that the alignment with a high scoring sequence should be correct. The alignments of the patterns with the (fl/a)s-b arrel enzymes in the database were evaluated by calculating the number of correctly aligned portions of secondary structure. A correct alignment was taken as at least 50% overlap between equivalent secondary structure elements in the pattern and the scanned sequence, taking account of the order of the p-strands and a-helices, i.e. p-strand 1 of the barrel from the pattern should be aligned with b-strand 1 of the barrel in the scanned sequence. The qualities of the alignments obtained for each pattern, based on the criteria above, are given in Table 7. The patterns AMPS-I, AMPS-5 and AMPS-3 give increasingly better alignments. The use of pattern AMPS-3 significantly improves the alignments for IFCB, 1WSYA and IGPS though 1WSYA ranks lower in the overall list than with AMPS-5 (Table 6). Correct alignments are obtained only for the structures comprising the pattern. Similarly, with the use of 3D profiles better alignments are generally obtained when including more structures in the search pattern, i.e. BOWIE-5 compared to BOWIE- 1. However, this improvement occurs for a different group of proteins, PYK, 1MLE and CDGT. Significantly, the inclusion of the BOWIE-5 scoring table into an AMPS scan, HYBRID-5 results in a major improvement of t,he alignments of 1FCB and IWSYA. The improvements in the alignments with 1FCB may be accounted for in part by the similarity between its sequence and that of GOX (37% sequence identity, (Xia & Mathews, 1996)). Wilmanns et al. (1991) have suggested that TIM, GOX, RUB, lWSYA, IGPS and PRAI are related
184
S. I). Pickett et al.
by divergent evolution as t*hey share a common phosphate binding site. The improvement, in the quality of the alignments of IWSYA and IGPS. particularly when using pattern AMPS-3, tends to support this conclusion. Nevertheless, PRAI consistently scores badly. A high score with a pattern does not’ necessarily mean that the alignment is correct, Compare, for example, the results for pyruvate kinase (PYK) and tryptophan synt’hase (1 WSYA) with pattern AMPS-3. Despite PYK ranking 6 and 1WSYA ranking 37, the alignment of WSYA has 13 out of 16 correctly matched a-helices and b-strands, whereas PYK has none. PYK has an cl/B domain between b-strand 3 and u-helix 3 of the barrel and this causes a misalignment of the search pattern.
6. Discussion A multiple alignment of the sequences of five proteins having the (/?/a),-barrel topology was derived solely from the co-ordinates of the proteins and was not dependant on defined information such as b-strand and a-helix definitions. The alignment has been shown to reproduce key features of the (j/a),-barrel topology with the correct alignment of all eight p-strands and a-helices. In particular, other sections of secondary structure in the p/a loops did not hinder the alignment. The alignment of the p-strands is consistent with other studies of @/a),-barrel proteins in terms of the packing of residues in the interior of the barrel (Lesk et al., 1989), however, in most cases the B-strands form a core of four residues with two side-chains into the barrel interior and two making contacts with the surrounding a-helices. The alignment is supported further by the fact that the aligned a-helical and /?-strand residues fulfil similar structural roles, those involved in strand-helix contacts being equivalenced in the five structures. The a-helix//?-sheet packing is similar to that in other a//? proteins. There is no conservation of side-chain volume at a global or a residue level. Sequence motifs were generated from the multiple alignment and two methods (Barton & Sternberg, 1990; Bowie et al., 1991) used to scan them against a database of 124 protein sequences, including 17 (@/a),-barrel enzyme sequences. The best results in terms of discrimination of (B/a),-barrel enzymes and the quality of the alignments were achieved when using information from the multiple alignment in the AMPS package (Barton & Sternberg, 1990). This motif, AMPS-& was able to identify the top 12% of sequences as having a (B/a),-barrel fold with 50% accuracy (igngring the sequences in the motif) and the bottom 50% of sequences as not being (p/a)*-barrel proteins with. lOOo/o accuracy. Nevertheless, in most cases the quality of t,he alignments were disappointing. The reasons for this may be found by examining the alignments in detail. This also sheds light on why some (/I/a),-barrel enzymes score consistently poorly e.g. ADPSGP and PRAI whilst some non-barrel struc-
tures tend to score well with all the rriotils. ,‘,g. I DPT, 3GR8 and SPGK. ln the discussion ~&w attention shall be focussed on alignment,s obtainfbcl with motif AMPS-5 but the general conc~insiotlt: apply to all the motifs. The motifs used in the AMPS scans were cnhosen in such a way so as to detect the repeating patt,ern of j-strand and a-helix found in the (p/a),-barrel enzymes. The search methods detect a-helixes and b-strands in the sequence but the pattern is not able to distinguish the order of the r-helices and a-strands in the barrel domain or t’o distinguish these secondary structure elements from a-helixes and P-strands in ot’her domains of the protein. For example, considering the alignment of AMPS-.5 with PYK shown in Figure 6, two of the a-helices. H4 and H5 and two of the p-strands. B5 and B6 are correctly aligned. Helix Hl of the motif is aligned with helix H2 of t,he PYK barrel domain. helix H3 and strand B4 are aligned with an u-helix and P-strand in the domain inserted between /%rand 3 and a-helix 3 of PYK and H7 and BX of the motif are aligned with an a-helix and b-strand in the C-terminal domain of PYK. Similarly with PRAI. which consist,s solely of a barrel domam, Hl and HI of the motif are correctly aligned but B4,B5,B6 and H5,H6 are misaligned. The situation with PRAI is further complicated, however: by the fact that there is no a-helix bet’ween /?-strand 5 and B-strand 6 (Wilmanns et al., 1992). Surprisingly. GWE(‘. a bifunctional enzyme, is higher in the list than either IGPS or PRAI, the sequences of which are contiguous in GWEC. Inspection of the alignment showed that the search pattern is separated over the two domains when scanned against GWE(‘. The consistent high scoring of several non-barrel sequences is also related to the poor discrimination by the motifs of a-helices and p-strands in the barrel domain. The alignment of AMPS-5 with 3GRS is shown in Figure 7. Seven p-strands and four a-helices in the motif are aligned with equivalent secondary struct’ure elements in 3GRS. an r/B protein. A further caveat in relation to the AMPS scans is the large range of allowed gaps between the different regions. A flexible gap is necessary as generally there are insertions and deletions within the loop regions of a family of proteins. However, in the case of the (/?/a)g-b arrel enzymes there are large variations in different regions of the structures. For example comparing GOX and TAA, the number of residues between the end of /?-strand 3 and the start of a-helix 3 is 6 and 64 whilst between B-strand 4 and a-helix 4 the number of residues is 59 and 9, respectively. Thus, the large flexible gaps necessary to cope with this sit.,uation do not provide a suffieient restraint, on t,he alignment algorithm.
7. Conclusion The limited success of the motifs in discriminating (j?/a)s-barrel enzymes from other proteins in the database reveals the limits of the sequence motif
185
(j?/a)s-Barrels
method when considering very distantly related proteins. Several factors contribute to this. (1) The regions of structural similarity between the five aligned struct’ures occur in the B-strand and of the structures. However, x-helical regions sequence patterns within these regions are common to globular cc//? proteins in general, not’ specifically t’he (/?/a),-barrel enzymes, as shown by the example of 3GR,S above. (2) The sequences of the enzymes are very diverse and the current methods employing either a nayhoff matrix or more specific residue environments cannot model this large diversity whilst still giving discrimination from other proteins. Furthermore, the observed range in volume of t,he p-barrels provides additional scope for variation in the sequence. (3) Following on from (2), it may well prove necessary to subdivide the (/?/a)s-barrel enzymes into a few subgroups and generate motifs for each group. The improvement in the alignments with pattern AMPS-3 is pertinent. This motif, comprised of TIM, GOX and RUB significantly improves both the rank and alignments with two proteins to which it has been suggested they are related by divergent evolution, 1WSYA and IGPS. The methods for searching the database used in this work are fast and work for a large number of different protein classes, nevertheless, in the case of the (P/a),-barrel enzymes more information (or more specific information) must be included in the searches. An alternative procedure may be the development of a combinatorial approach similar to t,hat, developed by Cohen et al. (1983, 1986) for secondarv structure prediction, incorporating information from structural analyses such as that of Qcheerlinck et nE. (1992), where it was found that (rp3 loops in @/a),-barrel enzymes involve evennumbered j-strands, there being at least one such loop in the first half of the barrel. The templates described above are not sufficient in themselves to define categorically a protein as a (P/a),-barrel. However, if a sequence is scanned against AMPS-5 high scoring sequences should be considered as possible @/a)s-barrels. This result could be combined with other information from model experiment and theory to derive a tentative for further assessment. Conversely, and as important, a low-scoring sequence can be excluded as a candidate for Thus, sequence (P/&-topology. template methods offer some discrimination but the problem of identifying common tertiary folds with very low sequence identity remains a challenge. We thank Dr Y. Lindqvist for the rubisco co-ordinates, Professor C:. Schulz for the sequence of cyclodextrin glycosyltransferase, Professor D. Eisenberg for the program ENVIRONMENTS, Dr P.A. Bates for a leastsquares fitting program and Dr S.A. Islam for assistance with the graphics program. PREPI. References Anfinsen. C. B.. Haber, E., Sela, M. & White, F. H., Jr (1961). The kinetics of formation of native
of t,he reduced during ribonuclease oxidation polypeptide chain. Proc. Nat. Acarl. 9ci.. P.R.A. 47, 1309-1314. Banner, D. W.. Bloomer, A. C., Petsko. 0. A.. Phillips, D. C. & Wilson. I. A. (1976). Atomic coordinates for triose phosphate isomerase from chicken muscle. Biochem. Biophys. Res. Commun. 72. 14G-155. Barton, G. J. (1990). Protein multiple sequence alignment and flexible pattern matching. Methods Enrymol. 183. 403-427. Barton. G. J. & Sternberg, M. J. E. (1987). A strategy for the rapid multiple alignment of prot,ein sequences: levels tertiary confidence from structure comparisons. J. Mol. Biol. 198, 327-337. Barton. G. J. & Sternberg. M. J. E. (1990). Flexible protein sequence patterns: a sensitive method to detect weak structural similarities. ,I. Mol. Biol. 212. 384-402. Bashford, D., Chothia, C. & Lesk. A. M. (1987). Deberminants of a protein fold. Unique features of the globin amino acid sequences. .I. Mol. Biol. 1%. 199-216. Bernstein, F. C.. Koetzle, T. F.. Williams. G.. Meyer, I). J., Brice. M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The protein data bank: a computer-based archival file for macromolecular structures. J, Nol. Biol. 112. 535-542. Blundell, T. L.. Barlow, D., Thornt,on, J. X. Taylor, W. R., Tickle, I. J., Sternberg, M. J. E.. Pitts, J. E.. Haneef. I. & Hemmings, A. M. (1986). Threedimensional structural aspects of the design of new protein molecules. Phil. Trans. Roy. Sot. ser. A, 317, 333-344. Blundell. T. L.. Sibanda, B. L., St,ernberg. M. J. E. & Thornton, J. M. (1987). Knowledge-based prediction of protein structures and the design of novel molecules. Nature (London), 326, 347-352. Howit=. ,J. U.. Liithy, R. & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253, 164-170. Browne, W. ,J.. North, A.. Phillips, D. C.. Brew, K., Vanaman, T. C. & Hill, R. L. (1969). A possible threedimensional structure of bovine alpha-lactalbumin based on that of hen’s egg-white Iysozyme. J. Mol. Biol.
42, 65-86.
(lohen, F. E., Sternberg, M. J. E. & Taylor, W. R. (1982). Analysis and prediction of the packing of alphahelices against a beta-sheet in the tertiary structure of globular proteins. J. Mol. Bid. 156, 821-862. Cohen. F. E., Abarbanel, R. M., Kuntz, I. D. & Fletterick, R. cJ. (1983). Secondary structure assignment for alpha-beta proteins by a combinatorial approach. Biochemistry,
22, 48944904.
(‘ohen. F. E.. Abarbanel, R. M., Kuntz. I. D. & Fletterick, R’. ,J. (1986). Turn prediction in proteins using a pattern-matching approach. Biochemistry, 25, 26&275. Dayhoff, M. 0. (1978). Editor of Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, National Biomedical Research Foundation. Washington, DC. Farber, G. K. & Petsko. G. A. (1990). The evolution of alpha/beta barrel enzymes. TrendA Biorhem. Sri. 15, 228-234. Goldman. A., Ollis, D. L. t Steitz, T. A. (1987). Crystal structure of muconate lactonizing enzyme at 3A resolution. J. Mol. Biol. 194, 143-153. Gribskov. M., McLachlan, A. D. & Eisenberg, D. (1987). Profile analysis: detection of distantly related
1X6
S. I). Pickett
proteins. Proc. Nut. Acad. Sri.. l:.S.A. 84. 4355-43.58. Henrick, K., Collyer, C. A. & Blow, I). M. (1989). Structures of n-xylose isomer&se from Arthrobacter strain B3728 containing the inhibitors xylitol and o-sorbitol at 2.5 A and 2.3 A resolution. respectively. J. Mol. Biol. 208, 129-157. ,Janin, J. & Chothia, C. (1980). Packing of alpha-helices ont,o beta-pleated sheets and the anatomy of alpha/ beta proteins. J. 1Mol. Riol. 143, 95-128. Jorgensen, W. L. & Tirado-Rives, J. (1988). The GPLS potential functions for proteins. Energy minimizations for crystals of cyclic peptides and crambin. J. Amer. Chem. Sot. 110, 1657-1666. Kruskal, J. B. & Sankoff, D. (1983). In Time Warps, String
Edits,
and
Macromolecules:
The
Theory
and
Pro&ice of Sequence Comparison (Sankoff, D. & Kruskal, ?J. B.. eds), pp. 265-310, Addison-Wesley. London. Lasters, I., Wodak, S. J., Alard, P. & Cutsem, E. (1988). Structural principles of parallel beta-barrels in proteins. Proc. Nat. Acad. Sci., I?.S.A. 85. 3338-3342. Lasters, I., Wodak, 8. J. 6 Pio, F. (1990). The design of idealized alpha/beta-barrels: analysis of beta-sheet closure requirements. Proteins, 7, 249-256. Lebioda, I,., Hatada, M. H., Tulinsky, A. & Mavridis, 1. M. (1982). Comparison of the folding of 2-keto-3deoxy-6-phosphogluconate aldolase, triose phosphate isomerase and pyruvate kinase. J. Mol. Biol. 162, 44-58. Lesk, A. M. & Chothia, C. (1980). How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136, 225-270. Lesk, A. M., Branden, C. I. & Chothia, C. (1989). Structural principles of alpha/beta barrel proteins. Proteins, 5, 139148. Lim, V. & Ptitsyn, 0. (1970). Constancy of the hydrophobic nucleus volume in myoglobin and hemoglobin molecules. Mol. Biol. U.S.S.R. 4, 372-382. Lindqvist, Y. (1989). R$lned structure of spinach glycolate oxidase at 2 A resolution. J. Mol. Biol. 209, 151-166. Matsuura, Y., Kusunoki, M., Harada, W. & Kakudo, M. (1984). Structure and possible catalytic residues of taka-amylase A. J. B&hem. 95, 697-702. Matthews, B. W. & Rossmann, M. G. (1985). Comparison Methods Enzymol. 115, of protein structures. 397420. Mavridis, I. M., Hatada, M. H., Tulinsky, A. & Lebioda, 2-kyto-3-deoxy-6L. (1982). Structure of phosphogluconate aldolase at 2.8 A resolution. J. Mol. Biol. 162, 419444. McLachlan, A. D. (1979). Gene duplications in the structural evolution of chymotrypsin. J. Mol. Biol. 128, 4%79. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443453. Neidhart, D. J., Howell, P. L., Petsko, G. A., Powers, V. M., Li, R., Kenyon, G. L. & Gerlt, J. A. (1991). Mechanism of the reaction catalyzed by mandelate racemase. 2. Crystal structure of mandelate racemase at 2.5 A resolution: identification of the active site and possible catalytic mechanism. Biochemistry, 30, 9264-9273.
et. al.
Pearl, I,. H. & Taylor. W. fC. (1987). Scqurrw sptvlfic~it~ of retroviral proteases. Suture ! /,on&tt/) 328. 351-354. Rao, S. T. & ftossman. M. G. (1973). (“omparisori ot’super secondary structures in prot,eins. .J. ,lfol. Biol. 76. 24 l-256. Remington. S. ,J. & Matthews, B. W. (1978). :\ general method to assess similarity of protein structures, wit,h applications to T4 bacteriophage lysozyme. Proc. Nat. Acad. Ski., I’.S.A. 75, 2180-2184. Remington. S. J. & Matthews, B. W. (1980). A s,vst,rmatic, approach to the comparison of protein structures. J. Mol. Biol. 140, 77-99. Rice, 1’. A.. (Goldman, A. & Steitz, T. A. (1990). A helixturn-&and structural motif common in alpha/beta proteins. Proteins, 8, 334-340. Rondeau, *J.-M., Tete-Favier, F., Podjarny, A.. Reymann, J-M.. Barth, P., Biellmann, J.-F. & Moras, I). (1992). Sovel KADPH-binding domain revealed by the crystal structure of aldose reductase. Vaturr (London), 355, 469-472. Rossmann, M. G. & Argos, I’. (1975). A comparison of the heme binding pocket in globins and rytorhrome b.5. J. Biol.
Chem. 250, 752,57532.
Rossmann, M. G. & Argos, I’. (1976). Exploring structural homology of proteins. J. Mol. Biol. 105, 7.595. Rossmann, M. G. & Argos, P. (1977). The taxonomy of protein structure. .I. Mol. Biol. 109, 99-129. Rouvinen, *J., Bergfors, T., Teeri, T., Knowles, ,J. K. C’. & -Jones, T. A. (1990). Three-dimensional structure of Cellobiohydrolase II from trichoderma rersei. Science, 241, 380-386. Sali, A. & Blundell. T. L. (1990). Definition of general topological equivalences in protein structures. J. Mol. Biol. 212. 403-428. Scheerlinck. J-l’. Y., Lasters, I., Claessens, M.. De Maeyer. M., Pio, F., Delhaise, P. & Wodak, S. ,J. (1992). Recurrent alpha-beta loop structures in TIM barrel motifs show a distinct pattern of conserved structural features. Proteins, 12, 299313. Schneider, G., Lindqvist, Y. & Lundqvist, T. (1990). Crystallographic refinement and structure of from ribulose- 1.5-bisphosphate c*arboxylase Rhodospirillum rubrum at 1.7 A resolution. J. Mol. Biol. 211, 98SlOO8. Stec, B. & Lebioda, L. (1990). Relined structure of yeast apo-enolase at 2.25A resolution. J. Mol. Biol. 211, 235-248. Stuart, D. I., Levine, M., Muirhead, H. Sr. Stammers, I). K. (1979). Crystal structure of cat muscle pyruvate kinase at a resolution of 2.6 A. J. Mol. Biol. 134, 109-142. Taylor, W. R. (1986a). The classification of amino acid conservation. J. Theoret. Biol. 119, 205-218. Taylor, W. R. (1986b). Identification of protein sequence homology by consensus template alignment. d. Mol. Biol. 188, 233-258. Taylor, W. R. (1988). Pattern matching methods in structure protein sequence comparision and prediction. Protein Eng. 2, 77-86. Taylor, W. R. & Orengo, C. A. (1989a). A holistic approach to protein structure alignment. Protein Eng. 2, 505-519. Taylor, W. R. & Orengo, C. A. (19896). Protein structure alignment. J. Mol. Biol. 208, l-22. Thornton, ,J. M., Flores, T. P., Jones, D. J. & Swindells, M. B. (1991). Prediction of progress at last. Nature (London), 354, 105-106. Wilmanns, M., Hyde, C. C., Davies, D. R., Kirschner, K.
(b/a),-Barrels & Jansonius, ,J. h’. (1991). Structural conservation in parallel P/r-barrel enzymes that catalyze three sryuential reactions in t,he pathway of tryptophan biosynthesis. Biochemistry, 30, 9161-9169. Wilmanns, M., Priestle, .J. P., Niermann, T. & ,Jansonius, J. E’. (1992). Three-dimensional structure of the hifunctional enzyme phosphoribosylanthranilate isomerase: indole glycerolphosphate synthase from Esderichiu coli refined at 2.0 A resolution. J. Mol. Rid. 223. 4774507.
Wilson. I). K., Rudolph, F. I3. & Quiocho. F. A. (1991). Atomic structure of adenosine deaminase complexed a transition-state analog: understanding with catalysis and immunodeficiency mutations. Scirncr. 252, 1278-1284. Xia, Z. X. & Mathews, F. S. (1990). Molecular structure of flavocytochrome b, at 2.4 A resolution. .I. Mol. Hiol. 212. 837-863.
Edited by A. R. Fersht