"i,,i
BIt,
' ~ ":',-~
ia(nn ELSEVIER
Biochimica et Biophysica Acta 1208 (1994) 247-255
etBiochi~ic~a BiophysicaActa
Structural analysis and classification of lipocalins and related proteins using a profile-search method Clare E. Sansom a,., Anthony C.T. North a, Lindsay Sawyer b a Departmentof Biocheraistry and MolecularBiology, Universityof Leeds, Leeds LS2 9JT, West Yorkshire, UK b Department of Biochemistry, Hugh Robson Building, University of Edinburgh, Edinburgh EH8 9XD, UK Received 14 February 1994
Abstract
A three-dimensional profile method of detecting amino-acid sequences compatible with the tertiary structure of any protein has been applied to the lipoealin family of 8-stranded beta-barrels. Profiles derived from six well-resolved lipocalin crystal structures were used to search a comprehensive, non-redundant protein sequence database. Each profile identified a sub-group of lipocalin sequences although no single profile was sufficient to identify the whole family. The a-l-acid glycoprotein sub-family was not identified by any lipocalin profile, indicating that known sequence differences in otherwise well conserved regions of these proteins may be reflected in structural differences. The predicted similarity between the/3-1actoglobulin and a-2u-globulin structures was much more marked than the similarity between their sequences, and a-l-microglobulin sequences were found to be compatible with the structure of epididymal retinoic acid binding protein which has an additional long C-terminal helix. Proteins of unknown structure which were predicted to be compatible with the lipocalin fold include a human mucin. In cases where a large protein family of low overall sequence similarity contains a small number of known structures, this technique can be useful in determining or confirming subtle structural relationships between family members. Keywords: Lipocalin; Calycin; Inverse folding; Protein sequence; Protein structure
1. Introduction
The lipocalins are a diverse and widely distributed family of small, predominantly extracellular transport proteins which show high binding affinity and selectivity for small hydrophobic molecules. Crystal structure determination of several lipocalins has revealed a common fold: a central 8-stranded beta-barrel with repeated + 1 topology [1], with an N-terminal helical turn and at least one long C-terminal helix. The structure of a typical lipocalin, mouse urinary protein [2] is shown schematically in Fig. 1. However, although the lipocalins are characterised by three
Abbreviations: RBP, retinol binding protein; BBP, bilin binding protein; ICYAN, insecticyanin; A2.U, ot-2u-globulin; MUP, major mouse urinary protein; e-RABP, epididymal retinoic acid binding protein; FABP, fatty acid binding protein; 3-D, three-dimensional; I-D, one-dimensional; PDB, Protein Data Bank; EMBL, European Molecular Biology Laboratory. * Corresponding author. E-mail:
[email protected]. Fax: +44 532 333167. 0167-4838/94/$07.00 © 1994 Elsevier Science B.V. All fights reserved SSDI 0 1 6 7 - 4 8 3 8 ( 9 4 ) 0 0 1 1 6 - X
well-defined and highly conserved sequence motifs [3] they do not show high overall sequence homology: the pairwise identity between two randomly chosen lipocalin sequences is typically only about 20% [4]. This is below the 25% level where structural similarities frequently go undetected [5]. A second family of small proteins which bind hydrophobic molecules, the fatty acid binding proteins (FABPs), has recently been identified [6]. These often bind the same ligands as the lipocalins but are almost exclusively intracellular. They show some limited sequence homology with the lipocalins, particularly towards the N-terminus. Crystallographic analysis has shown that these proteins also share a similar fold: a 10 stranded beta-barrel with repeated + 1 topology [7]. In view of the similarities of sequence, structure and function between these two protein families, and the 'cup-shape' common to the structures, they have been joined to form a 'super-family' and termed the calycins [3]. High-resolution structures of lipocalins show clearly that whilst the common core of the beta barrel is well
248
C.E, Sansom et al. / Biochimica et Biophysica Acta 1208 (1994) 247-255
giving the relative probabilities of finding each of the 20 amino acids at each position in the 1-D string - - the profile - - is then constructed. The compatibility of any sequence with a given profile is determined by summing the compatibility scores at each position, after the total score has been optimised by aligning the sequence with the structure. With very few exceptions, high-resolution crystal structures of globular proteins give high total scores with their own sequences, indicating complementarity between sequence and structure [11]. Other high scoring sequences, even of low sequence similarity, are likely to exhibit the same fold as the probe structure. We validated the method by searching for any high-scoring sequences of known but unrelated fold, and by repeating the analysis using both a typical FABP structure and an unrelated protein of similar size.
Fig. 1. Ribbondrawing of a typicallipocalin,mouse urinaryprotein (see [2]). The hydrophobic ligand 2-(sec-butyl)thiazoline is bound in the interior of the beta-barrel. Moleculargraphics using MOLSCRIPT(Kraulis, P.J. (1991) J. Appl. CrystaUogr.24, 946-950).
defined, there are significant differences in the length and conformation of the loop regions. For example, rat epididymal retinoic acid binding protein (e-RABP) [8] is the only known lipocalin structure to have two long, well-defined helices at the C-terminal end of the beta-barrel, a single C-terminal helix being observed in all previously determined lipocalin structures. The structures of these surface loops must determine the exact ligand-binding specificity of the proteins as well as, in some cases, their binding affinity for cell surface receptors. In a fairly large and rapidly burgeoning protein family, with a small number of characteristic structures known and relatively low overall sequence similarity between individual members, techniques of 'inverse protein folding' - - essentially searching a database for protein sequences compatible with a given structure - - can be used to compare and classify the members of the family. Possible subtle structural relationships between members of unknown structure can be determined and completely unrelated sequences which might share the same fold may be recognised. Measures of structural similarity which are independent of the exact protein sequence, such as 3-D profile analysis, may be a useful guide to functional and evolutionary relationships. We have used the 3-D profile method developed by Eisenberg and co-workers [9-12] to search a comprehensive protein sequence database with profiles derived from each well characterised lipocalin structure in turn. In this method each residue position is assigned an 'environment class' based upon the hydrophobicity, polarity and secondary structure of its environment, reducing the 3-D structure to a 1-D string which is independent of, whilst being derived from, the amino-acid sequence. A table
2. Materials and methods 2.1. Protein sequences and coordinates
Six well-resolved lipocalin crystal structures were used for this study. Coordinates of RBP and BBP were obtained from the Protein Data Bank (PDB) ([13]; data entry codes 1RBP [14] and 1BBP [15]). Coordinates of insecticyanin (ICYAN) [16], a-2u-globulin (A2U) and mouse urinary protein (MUP) [2] and epididymal retinoic acid binding protein (e-RABP) [8] were obtained from the authors prior to their public release. Coordinates of intestinal fatty acid binding protein (a member of the FABP family) and flavodoxin, which were used for validation, were also obtained from the PDB. All protein sequences were extracted from the OWL composite sequence database [17]. The coordinates of one subunit of apo e-RABP were made available during the course of this study (M. Newcomer, private communication). One 9-residue surface loop of this protein was only available as an alpha-carbon trace; full coordinates for this loop were generated using O v.5.8 [18] and the whole structure assessed using the profile method [10] before it was used to search the database. There was no noticeable difference in the profile score for this loop region compared to the rest of the protein. 2.2. M e t h o d o l o g y
A single monomer of each lipocalin structure was used to generate a 3-D profile [9]. Where secondary structure records were not available from the PDB files, they were generated using the algorithm of Kabsch and Sander [19] and confirmed by inspection. The profiles were used to scan version 19 of the OWL composite sequence database [17] (with over 45000 sequences) for sequences which were compatible with each fold. Results of these scans are given in terms of 'Z-scores'; the Z-score of any sequence against a given profile is defined as the number of standard
C.E. Sansom et aL /Biochimica et Biophysica Acta 1208 (1994) 247-255
249
4O806
deviations of its total score above the mean score of all sequences of a similar length. It should be independent of the size of the database. Z-scores for all sequences that had been defined as lipocalins in either PROSITE v.10.2 [20] or PRINTS, the features database from SERPENT, version 3 [21] were extracted for further analysis along with all non-lipocalin sequences which the profile scans suggested could have a similar fold (see below). Further analysis of key sequences was performed by using SWEEP [21] for database scanning; the GCG program BESTFIT [22] for pairwise similarity evaluation and the algorithm of Gamier, Osguthorpe and Robson [23] for secondary structure prediction. Individual sequences were scanned for characteristic features from the PRINTS database by using the program FEAT (D.N. Perkins, unpublished). Molecules were examined on a Silicon Graphics IRIS workstation using version 2.1 of the INSIGHT II molecular modelling package [24].
4O
~ (A)
BE Total number of sequences
III
•
l
Number of lipocalin sequences /
I- 30 == =
= 800o - I l l =
,o ti I 0
tlllllJ
___
i
_ _ _
nt ' 0
<1 >1 >2 >3 >4 >5 >6 >7 >8 >10>15>20>30 Z-score 40722 Ill
I 4O
I t (B)
n
Tota,numbero, sequences
~
/I
n
Numberof lipocalin sequences
I
oolk
g
3. Results and discussion
The pattern of high-scoring sequences extracted by each lipocalin profile followed a similar trend, which is illustrated in Fig. 2 with two typical examples (retinol binding protein and MUP). In each case, whilst all of the sequences with the highest Z-scores are members of the lipocalin family, a significant percentage of lipocalin sequences does not match any given profile. This is similar to results reported for the beta-alpha barrel fold [12] but contrasts with the more clear-cut results reported earlier for the globin family [9]. Taken together, these results indicate that all members of a protein family cannot always be discriminated by using this method with a single known structure. The large majority of sequences with Z-scores over approx. 7 with the profile from any of the six lipocalin structures are known lipocalins; the high proportion of lipocalin sequences with Z-scores between 5 and 7 with any lipocalin structure indicate that there is a moderate probability that these sequences will also have the same fold. Proteins scoring over 7 with any one of the lipocalin structures must have a high probability of having the same fold, even if they have no sequence similarity or apparent functional similarity to the lipocalins. In agreement with Wilmanns and Eisenberg [12] we are therefore regarding Z-scores of 7.0 or greater as highly significant and those between 5.0 and 7.0 as moderately significant. Typically, 30-35% of sequences classified as lipocalins, and 85% of all sequences, gave Z-scores < 1.0 with each individual profile. These scores give no evidence of compatibility between sequence and structure, and individual scores in this range have not been included in the analysis below. However, almost all lipocalin sequences gave significant scores against one or more of the six profiles. We tested the likelihood of sequences matching a random unrelated structure by chance, by constructing profiles
4000
20 (~
2000
10 =E
=
=
-~
0
tlILJ I .m
mUm
I | .U
~ 0 <1 >1 >2 >3 >4 >5 >6 >7 >8 >9>10>15>20>30 Z-score
Fig. 2. The results of a search of the OWL sequence database with 3-D profiles derived from two representative lipocalin structures, (a) RBP and (b) MUP. The number of lipocalin and non-lipocalin sequences extracted at or above given Z-scores are shown; sequences with a high Z-score against a structure are likely to have the same fold.
from two non-lipocalin structures and extracting any matching lipocalin sequences. One of these was a related FABP structure rat intestinal fatty acid binding protein (PDB code 2IFB; [25]) and the other was flavodoxin from Chondrus crispus (PDB code 2FCR; [26]). Flavodoxin, an a/fl protein, was chosen as it is a similarly compact globular protein of a similar size to the lipocalins, neither highly glycosylated nor complexed to other macromolecules and with d high resolution structure available with no missing or ambiguous residues. No lipocalin sequence gave a Z-score of over 5.0 with either of these controls; the highest scores for each were Z = 3.69 against 2IFB (out of 16 lipocalins giving scores over 1.0) and Z = 2.30 against 2FCR (out of 13 giving scores over 1.0). Conversely, whilst several immunoglobulin domains which are known to resemble superficially the lipocalin fold gave Z-scores of over 5.0 with at least one lipocalin structure, many other proteins known to contain 8-stranded beta-barrels (such as superoxide dismutase and rubisco) did not score significantly against any lipocalin.
250
C.E. Sansom et al. /Biochimica et Biophysica Acta 1208 (1994) 247-255
A cross-correlation matrix of Z-scores for the six lipocalin structures used against their respective sequences is given in Table 1. As expected, each structure scores highest against its own sequence. A 'theoretical own-sequence score' for a protein of a given length can be calculated by using an equation obtained by least-squares fitting to a l o g - l o g plot of profile scores of all available well resolved X-ray structures of proteins against sequence length. All coordinate sets in the January 1991 release of the PDB where the actual score of the structure against its own sequence was less than 45% of this theoretical score were either model structures or known to be seriously mis-folded [11]. Here, the 'own-sequence Z-scores' reported are consistently high in spite of the fact that all the profiles were derived from isolated monomers of proteins known to exist in a variety of oligomeric states. In all cases where the profile score was found to depend significantly on oligomeric state the monomer is less than 100 amino acids in length [10]. There was no entry in the O W L database for mature e-RABP, but the precursor sequence, E R B P _ R A T (with a 22-residue signal peptide) was extracted from the database with a Z-score not significantly different from those of the other 'correct' sequences. As expected, there is a very strong correlation between the structures of MUP and A2U, and a significant, but less strong, correlation between those of BBP and insecticyanin. No structure has a Z-score above 5.0 with the sequences from more than two other lipocalin structures, and the sequence of human retinol binding protein does not seem to be compatible with any other lipocalin structure. 3.1. Sub-classification o f lipocalins according to structural patterns All sequences defined as lipocalins in the Leeds features database (PRINTS) or in PROSITE, including those
listed in PROSITE as 'false negatives', have been classified according to their Z-scores against the six lipocalin structures (Table 2). High scores against a particular structure may indicate a more subtle structural relationship with that protein which may not be clear from sequence similarity alone. There seem to be two distinct 'sub-families', one showing complementarity to the structure of e-RABP and the other to those of A2U and MUP, whilst few other lipocalin sequences give high scores against RBP, BBP or insecticyanin. Purpurin, the only other lipocalin to score highly against RBP, is 53% identical and 71% similar to human serum RBP. In Fig. 3 some lipocalins are grouped according to 'sub-family' around a phylogenetic tree derived from sequence similarity [27]. Whilst in most cases sequence similarity implies structural similarity, there are one or two anomalies. The 3,-chain of complement protein C8 is nearer in sequence to A2U but seems more compatible with the structure of BBP, and neither the a-l-acid glycoproteins nor, interestingly, the C1 chain of crustacyanin would be predicted to be members of this family from the structural analysis alone. Crustacyanin C1 and all other lipocalins with sequences compatible to the structures of BBP a n d / o r insecticyanin share the distinctive disulfide bonding pattern of these two proteins [27]. The group of proteins which correlates highly with the e-RABP structure includes prostaglandin D synthases and the a-l-microglobulins. Neither a functionally similar, androgen-dependent protein recently identified in the lizard ([28]; E M B L code S18880) nor any other lipocalins known to bind retinoids gave significant Z-scores with this profile. It is interesting to speculate whether the a-l-microglobulins (which are co-expressed with trypsin inhibitors) also possess the second C-terminal helix which characterises the e-RABP structure. Two new sequences which fitted the lipocalin profile but had not yet been added to the PRINTS database were also highlighted by the e-RABP profile. Both of these were D N A sequences of a common
Table 1 Cross-correlation matrix showing Z-scores for 3-D compatibility of the sequences of six lipocalins of known structure, against the respective structures Sequence
Structure BBP
BBP IEBR RETB HUMAN MUP RAT a ICYA_MANSE MUP6_MOUSE a ERBP._RAT b
46.20 1.04 3.79 15.78 3.10 2.36
RBP
A2U
1.70 37.44
MUP
15.30 37.94
3.67 1.14
ICYAN
30.52 2.24
2.09 38.18 1.11 2.05
e-RABP 3.30
30.88 38.75 5.46
10.08 2.66 2.46 36.03
Sequence names are those in the OWL database. A blank in the table occurs wherever a sequence did not give a Z > 1.0 score with a profile. Scores higher than 7 indicate a high probability of the same fold; scores higher than 5 indicate a moderate probability of the same fold. Gap opening and gap length penalties were, respectively, 5.0 and 0.5. The sequences of the variants of MUP and A2U with the highest percentage sequence similarity to those used for the structure determinations are shown. MUP RAT is the OWL code for a rat a-2u-globulin. b ERi]P_RAT (Newcomer, M.E. and Ong, D.E.J. (1990) J. Biol. Chem. 265, 12876-12879) is a precursor of rat epididymal retinoic acid binding protein with a 22-residue signal peptide. The mature protein sequence was not present in the OWL database.
251
C.E. Sansom et al. / Biochimica et Biophysica Acta 1208 (1994) 247-255
Table 2 Sorted table giving Z-scores of all lipocalin sequences extracted by the sequence motifs defined in PRINTS and/or in PROS1TE (including sequences recorded as false negatives in PROSITE), against the six well-characterised lipocalin structures, and the mean Z-score for each sequence Structure
ERABP
Sequences with closest match to the e-RABP structure Androgen-dependent pr (rat) 36.03 a- 1-Microglobulin (pig) 12.29 a-l-Microglobulin (rat) 11.47 Lipocalin precursor (toad) 11.37 a-l-Microglobulin (human) 10.96 Von Ebner's gland pr (human) a 10.00 PG-D synthase (human) 9.80 7-1-microglobulin prec (fish) 9.61 PG-D synthase 2 (human) 9.40 Olfactory protein prec. (frog) a 8.16 Crustacyanin A2 subunit 7.79 Sequences with closest match to the human RBP structure Plasma RBP (bovine) Plasma RBP (rat) Plasma RBP (human) (from PDB) Plasma RBP (human) Plasma RBP (rabbit) Plasma RBP (pig) Plasma RBP (frog) Purpurin (chicken) 1.07 Plasma RBP II (trout) Plasma RBP I (trout) Sequences with closest match to the BBP structure BBP (white butterfly) 3.30 BBP (cabbage butterfly) (PDB) 3.30 C8 complement (human) 2.78 Sequences with closest match to the ICYAN structure A-Insecticyanin (horuworm) 2.66 B-Insecticyanin (hornworm) Apolipoprotein D (human: PDB) 2.97 Apolipoprotein D (human) a 2.53 Apo D precursor (human) 2.83 Sequences with closest match to the A 2 U structure ot-2u-Globulin (rat) 10.08 a-2u Precursor I (rat) 12.12 ot-2u (rat: S type) 10.14 ol-2u Prec. clone 43 (rat) 10.75 a-2u (rat: L type) 10.24 a-2u Prec. clone 36 (rat) 10.54 Minor mouse urinary protein 6.83 CH21 marker protein (chicken) 8.60 Quiescence specific (chicken) 8.67 Probasin precursor (rat) a 1.38 NGAL lipocalin prec. (mouse) 1.38 /3-Lactoglobulin I (horse) 2.39 /3-Lactoglobulin I (donkey) 2.09 NGAL lipocalin prec. (rat) 5.73 PG-D synthase (rat) 7.38 BLG precursor (bovine) 3.78 fl-Lactoglobulin B (buffalo) 3.92 /~-Lactoglobulin B (bovine) 3.78 BLG precursor (sheep) 3.37 BLG precursor (bovine) 3.78 BLG precursor (goat) 3.54 PG-H2 isomerase prec. (rat) 4.78 BLG I variant (pig) BLG I fragments (dolphin) 1.49 L epididymal pr prec. (lizard) 2.03 Placental protein 14 (human) 2.83 BLG II fragment (horse) a 4.91
RBP
BBP
ICYAN
A2U
MUP
Mean 8.21 4.97 4.32 1.90 5.14 3.52 5.65 5.04 4.95 4.65 3.76
1.14 1.63 2.07
2.36 6.50 6.72
2.05 1.85 1.14
2.24 3.75 3.42
5.46 3.78 1.10
1.95 4.06 1.01 5.16
8.41
3.85
3.88 7.16 3.17 1.94 4.24
2.36 2.66 2.45 1.58 7.21
2.67 3.74 9.57 3.86 7.76 4.57
2.98 3.31 7.28 1.79 6.90 4.52
7.14 3.36 37.61 37.48 37.44 36.53 36.23 36.15 34.12 28.74 15.66 14.96
1.11 1.13
1.70 1.70 1.03
46.20 46.20 7.89
15.30 15.30 1.20
3.67 3.56 2.29 1.47 2.15
15.78 15.14 4.04 2.14 3.88
38.18 37.24 8.04 7.84 7.75
3.79 3.29 2.81 2.75 2.81 2.73 5.43
7.75 7.62
1.68
2.09 0.75 2.81 2.61 2.81 1.80 4.94 4.87 4.93
2.27
7.33
2.67
1.52 2.25 3.30 3.38 3.25 3.18 3.38 2.87 1.19 1.23
5.06 2.15 2.12
2.49
1.66
0.73
1.97
6.27 6.43 6.41 6.09 6.04 6.21 5.69 5.16 2.61 2.49
1.11 1.04
2.46
4.32 4.10 3.92 4.21 3.99 3.70
2.56 1.85 1.71 3.45 1.89
3.73
1.02
11.08 11.08 2.94 10.05 9.32 2.89 2.33 2.94
37.94 37.83 36.90 36.89 36.37 35.38 27.83 15.83 14.39 14.06 12.26 10.73 10.47 9.98 9.37 9.36 8.73 8.48 8.48 8.36 8.07 7.28 6.32 5.89 5.62 5.29 5.14
30.88 33.15 30.19 29.47 29.63 28.33 26.99 14.46 12.68 11.05 9.32 10.37 10.10 8.04 5.26 8.29 7.11 6.90 6.86 6.65 6.49 3.58 3.58 5.89 2.64 5.08
14.00 14.80 13.81 13.75 13.64 13.13 12.00 8.59 8.33 4.42 5.87 3.92 3.78 5.47 4.40 5.20 4.54 4.39 4.35 4.77 4.11 3.23 2.16 2.50 2.62 1.35 2.84
252
C.E. Sansom et al. / Biochimica et Biophysica Acta 1208 (1994) 247-255
Table 2 (continued) Structure
ERABP
RBP
BBP
ICYAN
A2U
MUP
Mean
2.01 1.97 1.22 2.53 1.98 4.61 3.69
1.58 1.39 1.11 1.96 1.43
1.34 1.02
2.20 2.23
31.17 31.14 30.52 29.90 30.63 32.40 31.96 15.79 8.05 7.55 9.75 7.59 9.08 7.12
39.05 39.02 38.75 38.45 38.15 34.83 34.78 19.03 11.91 11.12 10.04 9.83 9.27 7.63
12.82 12.77 12.34 12.54 12.44 13.13 13.37 6.29 5.04 4.83 4.49 3.69 4.02 3.32
2.05
1.89
2.33
4.42
3.56 2.56 2.34 1.53 2.06 1.65 1.22
2.49 2.89 2.08 1.11 1.55
2.37 1.70 2.60 0.92 1.65 0.91 1.54 1.07 0.60 0.28 0.20
Sequences with closest match to the M U P structure
MUP 2 precursor (mouse) MUP 1 precursor (mouse) MUP 6 precursor (mouse) MUP 8 precursor (mouse) MUP mRNA (mouse) MUP 4 precursor (mouse) MUP 5 precursor (mouse) Aphrodisin (hamster) a fl-Lactoglobulin II (horse) /3-Lactoglobulin II (donkey) Odorant binding pr prec. (rat) a fl-Lactoglobulin II (cat) Odorant binding pr (bovine) a Von Ebner's gland pr (rat) a
3.10 3.07 2.46 2.40 2.47 6.91 8.60 1.54 4.91 5.12 5.76 3.07 5.76
1.39 1.82 1.92 1.40
1.19
1.67 2.52
2.63
1.15 3.58 1.46 1.10
4.97 1.96 1.37 1.05 2.76
No match at Z > = 5.0 to any structure
Apolipoprotein D prec. (rat) a-l-acid GP1 prec. (human) a Odorant binding pr prec. (rat) a a-l-acid GP2 prec. (human) a BLG fragments (pig) BLG fragments (pig) a fl-Lactoglobulin (kangaroo) a Crustacyanin C1 s / u n i t a a-l-acid GP8 prec. (wild mouse) a a-l-acid GP1 prec. (wild mouse) a a-l-acid GP1 prec. (mouse) a
1.42 4.66 4.22 3.39
2.72 1.79 1.08
1.09
1.47 1.62
2.23 2.17
a Sequences not matching all 3 motifs of the lipocalin signature in PRINTS at a statistically significant level. Z-scores below 1.0 are not shown.
precursor of a-l-microglobulin (the N-terminal half of the sequence) and a trypsin inhibitor also known as bikunin [29]: the mature proteins, which are both serum glycoproteins, are produced by proteolytic cleavage. This confirms the impression from the high score of the precursor sequence ERBP_RAT with the mature e-RABP structure, that unless a precursor is itself stable in a different fold from the mature protein, the presence of (for instance) a signal peptide is unlikely to affect the discrimination of an Eisenberg scan. More than half the lipocalin sequences form a large group which give significant Z-scores against the MUP and A2U structures only. This includes, besides the expected numbers of MUP and A2U variants, other odour-binding proteins and proteins involved in the reproductive system, and the fl-lactoglobulin family. The fl-lactoglobulins (BLGs) are the major whey protein of domestic mammals; they have been recently reviewed by Hambling et al. [30]. Whilst /3-1actoglobulin sequences are slightly more similar to MUP and A2U than to the other lipocalins of known structure (from percentage scores from BESTFIT), these differences are not statistically significant. In many non-ruminant species (see, e.g., [31]) BLG is encoded by more than one gene; the two main variants, BLG I and II, are distinguished by the Eisenberg scans with BLG Is seeming to fit A2U more closely and BLG IIs giving higher scores against MUP. Ruminant BLGs show genetic polymorphism at a single locus and, unlike BLGs from
other species, form dimers at physiological pH [30]: the BLG B variant from several ruminant species fit the structure of A2U most closely, forming a cluster in the table. A single form of BLG isolated from the milk of the Eastern Grey Kangaroo [32] does not seem to fit this pattern at all. Apolipoprotein D is one of a very few sequences which does not match any of the six lipocalin profiles, but which gives an unambiguous match to all three of the lipocalin motifs in PRINTS [21]. The largest group of sequences which are classified as lipocalins but which do not fit any of the lipocalin profiles is the a-l-acid glycoprotein subfamily (see, e.g. [33,34]): these proteins are heavily glycosylated. As most of these do not match fully the three motifs defining the LIPOCALIN feature, and some are also recorded as 'false negatives' in PROSITE it seems likely that these proteins may have some significant structural differences from the rest of the lipocalin family. In order to test the usefulness of a combined Z-score as a discriminatory tool, the mean scores of all lipocalin sequences over all six profiles were then calculated (Table 2). All scores below 1.0 were taken to be 0.0: as the distribution of scores is likely to extend below - 1 . 0 [9] this is a reasonable approximation. The mean of these mean scores for all lipocalin sequences was 6.0 (s.d. 4.1). The lipocalins were then divided into three classes: those with an equivalent structure in the test set (where the means are skewed by very high 'own-sequence scores')
253
C.E. Sansom et aL / Biochimica et Biophysica Acta 1208 (1994) 247-255
(class A); those other lipocalins matching all 3 motifs in PRINTS (class B); and fragments or incomplete matches (class C). The means of the mean scores for the three classes were, respectively, 10.1 (s.d. 3.7) for class A; 4.3 (s.d. 1.6) for class B and 2.5 (s.d. 1.8) for class C. Mean scores against all profiles were also determined for 30 sequences matching the related FABP profile; the mean of these means was 0.3 (s.d. 0.2). These values are likely to
be higher than would be expected from a sample of non-lipocalin sequences taken at random from the OWL database. The difference between the mean scores of each 'adjacent' pair of classes was highly statistically significant ( P < 0 . 0 0 5 in each case). Thus these 'combined scores' are more useful than the individual profiles in determining which sequences will take up the lipocalin fold.
~2UG P20K APHR 24p3
OBP
IBLG
PRB
PP14 PZBP 24p3
18.5k/a2UG
c.,
RPDS
RPDS HPDS ~
•
I I
P20K
I
/
/
-
Pp 4
HPDS al MG
[OBP II]
OBP
OBP II
APHR
18.5k (xl MG VEGP
18.5k/RBP [oHGP]
BGP RBP PURN
ALPD
.MBBP/18.5k CNA2
[CNC1] MBBP ALPD
FBBp
Fig. 3. Qualitative comparison of the structural 'sub-families' of lipocalins derived from the profile searches with a phylogenetic tree of the lipocalins. The central diagram is a phylogenetic tree of 25 lipocalins derived from sequence similarity, and is redrawn with permission from [27]. The scale bar represents the branch length corresponding to 0.1 amino-acid substitution per site. Lipocalins with high resolution 3-D structures used in the searches are boxed. These sequences are grouped into 'families' according to which of the six structures they were found to be most compatible to, and placed round the outside of the tree accordingly. The sequences in square brackets are those which do not match any of the profiles with a Z-score over 5.0. Abbreviations for Fig. 3 were taken from [27]. Lipocalin sequences with 3-D structures used in this study: RBP, retinol binding protein; MBBP, insecticyanin; PBBP, bilin-binding protein; 18.5k, e-RABP (rat); and a2UG, a-2u-globulin. Other sequences: PURN, purpurin; ALPD, apolipoprotein D (human); CNC1, crustacyanin C1; CNA2, crustacyanin A2; a l G P , a-l-acid glycoprotein; 2 PZBP, odour-binding protein (pyrazine-binding); PRB, probasin; APHR, aphrodisin (hamster); OBP, odorant-binding protein (rat); PP14, placental protein 14 (human); flLG, fl-lactoglobulin; P20K, quiescence specific polypeptide 20K; HPDS, brain PGD synthase (human); RPDS, brain PGD synthase (rat); 24p3, neutrophil gelatinase-associated lipocalin precursor (24p3 protein); C8y, complement component C8 y chain (human); a 1MG, a-l-microglobulin; OBP II, odor-binding protein II (rat); VEGP, Von Ebner's gland protein; and BGP, Bowman's gland protein (frog).
254
C.E. Sansom et al. / Biochimica et Biophysica Acta 1208 (1994) 247-255
3.2. Identification o f unrelated sequences with similar folds
All non-lipocalin sequences which give a Z-score of 7.0 or over, indicating a high probability of a similar fold are listed in Table 3. These proteins are all of unknown 3-D structure. It should be noted that no non-lipocalin sequence scored in this range with more than one lipocalin structure. The mean scores (against all 6 profiles) of the 15 sequences in this table range from 1.66 to 3.68, with a mean value of 2.7 (s.d. 0.7). This distribution cannot be distinguished statistically from that of the set of lipocalins defined earlier as 'class C'. However, these sequences match the lipocalin profiles much more closely than the FABP sequences do ( P < 0.0001). Several proteins of known 3-D structure give Z-scores in the range 5 - 7 , indicating a moderate probability of the same fold. Several of these are immunoglobulin chains (e.g., PDB code 1FVB heavy chain, which scores Z = 5.49 against the BBP profile), which are beta-barrels with overall similarity to the lipocalin fold though with different topology [34]; others - - e.g., parvalbumin and concanavalin A - - have substantially different folds, although concanavalin A is an all-beta protein of similar size to the lipocalins. This confirms that scores in this range cannot be taken as more than a guide to a possible folding pattern. Avidin [35], an 8-stranded beta-barrel with ' + 1' topology but with longer strands and shorter loops than is common in the lipocalins, was not extracted by any of the lipocalin 3-D profiles, although it had previously been suggested as related to the lipocalins by a similar scanning technique based only on sequence similarity and local secondary structure [36]. Streptavidin, an evolutionarily related pro-
tein with a very similar structure to avidin but with only 33% sequence identity, was not extracted by either method. One sequence - - a human mucin fragment (EMBL code A35690), which scores Z = 8.96 against the BBP structure - - seems to be of particular interest. This is a highly glycosylated, 128-residue protein very rich in serine, threonine and proline and containing a number of tandem repeats [37]. Several regions of this sequence show similarity to one region of a chicken retinoic acid binding protein (a member of the FABP family). Secondary structure prediction gave a strong likelihood of a pattern of alternating beta strands and turns in this protein; whilst secondary structure prediction alone cannot be taken as predictive, the combination of the two methods strongly suggests a barrel structure. Other proteins for which these Eisenberg scans suggest a beta-barrel structure include endo-l,4-fl-xylanase precursors, a thermopsin precursor and NADH-ubiquinone reductase. None of these proteins has any known similarity to the lipocalins, either in sequence or in function.
4. C o n c l u s i o n s
The results reported here suggest that, where structural data are only available for a few members of a large protein family such as the lipocalins, 3-D profile analysis of each structure in turn may be a powerful way of determining both subtle structural relationships between members of the family and unrelated proteins which may share a similar fold. This could be a useful technique for determining equivalent structures for homology modelling, or for generating starting models for solving crystal struc-
Table 3 Non-lipocalin sequences giving Z-scores > 7.0 (indicating a high probability of having the same fold) with any of the lipocalin structures Name Z-score Protein title Bilin-binding protein
A35690 MUSIGHFV CD1 SYLFL UL11 HCMVA VGP8 EBV
8.96 7.30 7.14 7.06 7.01
mucin (clone SIB124) - human (fragment) Ig H-chain V-JHl-region - Mus musculus T-cell surface gp CD1 hypothetical protein ULll. - human cytomegaiovirus probable membrane antigen GP85 - Epstein-Barr virus
8.07 8.07 7.38
endo-l,4-fl-xylanase precursor - Bacillus circulans endo-l,4-fl-xylanase precursor - Bacillus subtilis probable E4 protein - human papillomavirus type 11.
8.62 8.45 8.34 7.12
thermopsin precursor( Suifolobus acidocaldarius) NADH-ubiquinoneoxidoreductascell. 6 (Ascaris suum) hypothetical protein (psaC region) hypothetical protein in NIFH1 5' region (fragment).
7.50 7.13
gene ND1 intron 3 protein 1 (fragment)
Retinol-binding protein
XYNA BACCI XYNA BACSU VE4 HPV1 a-2u-Globulin
THPS SULAC NU6M_ASCSU PS0372 YNI3 METIL lnsecticyanin
S06056 VME1 CVH22
E1 glycoprotein(matrix glycoprotein)
Mouse urinary protein
DNA polymerase - human cytomegalovirus HS5DNAPOL 7.65 The highest score of a non-lipocalin sequence against e-RABP was 6.91 for an Ig kappa chain V region (OWL code MUSIGKABL).
C.E. Sansom et al. / Biochimica et Biophysica Acta 1208 (1994) 247-255
tures by molecular replacement. Here the complementarity between the /3-1actoglobulin sequences and the structures of A2U and MUP, for example, was much more marked than that revealed by sequence similarity alone. The extraction of sequences such as the a-l-microglobulin/bikunin precursor, using a profile derived from a mature lipocalin structure, implies that the stability of this lipocalin as a mature protein is not affected by the fact that it is synthesised as part of a much larger precursor. It will be interesting to see whether, as predicted here, the a-l-microglobulin sub-family shares the extra C-terminal helix of e-RABP, and whether there are any significant structural differences between the a-l-acid glycoproteins and the rest of this protein family. From an analysis of their mean scores against all six profiles, the 15 non-lipocalin sequences giving Z-scores over 7.0 against any lipocalin profile seem to be as compatible with the lipocalin fold as a group of 'non-standard' lipocalin sequences.
Acknowledgements We wish to thank Dr. M. Newcomer for the e-RABP coordinates, and Drs. H. Holden and I. Rayment for the insecticyanin coordinates, before public release by the PDB. We wish to thank Mr. D.N. Perkins for the use of his unpublished program, FEAT, Dr. J.E. Lydon for help with figure preparation and Dr. J.A. Halliday for useful discussions. This work benefited from the use of the SEQNET computer facilities and C.E.S. is supported by a grant to the Leeds Molecular Recognition Centre for Biological Systems from the Science and Engineering Research Council.
References [1] North, A.C.T. (1991) Biochem. Soc. Syrup. 57, 35-48. [2] Bocskei, Z., Groom, C.R., Flower, D.R., Wright, C.E., Phillips, S.E.V., Cavaggioni, A. Findlay, J.B.C. and North, A.C.T. (1992) Nature 360, 186-188. [3] Flower, D.R., North, A.C.T. and Attwood, T.K. (1991) Bioehem. Biophys. Res. Commun. 180, 69-74. [4] Flower, D.R., North, A.C.T. and Attwood, T.K. (1993) Protein Science 2, 753-761. [5] Sander, C. and Schneider, R. (1991) Proteins 9, 56-68. [6] Glatz, J.F.C and Van der Vurse, G.J. (1990) Mol. Cell Biochem. 98, 237-251. [7] Zanotti, G., Scapin, S., Spadon, P., Veerkamp, J.H. and Sacchettini, J.C. (1992) J. Biol. Chem. 267, 18541-18550. [8] Newcomer, M.E. (1993) Structure 1, 7-18.
255
[9] Bowie, J.U., Liithy, R. and Eisenberg, D. (1991) Science 253, 164-170. [10] Eisenberg, D., Bowie, J.U., l.,iithy, R. and Choe, S. (1992) Faraday Discuss. Chem. Soc. 93, 25-34. [11] Liithy, R., Bowie, J.U. and Eisenberg, D. (1992) Nature 356, 83-85. [12] Wilmanns, M. and Eisenberg, D. (1993) Proc. Natl. Acad. Sci. USA 90, 1379-1383. [13] Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. and Tasumi, M. (1977) J. Mol. Biol. 112, 535-542. [14] Cowan, S.W., Newcomer, M.E. and Jones, T.A. (1990) Proteins 8, 44-61. [15] Huber, R., Schreider, M., Mayr, I., Mueller, R., Deutzmann, R., Suter, F., Zuber, H., Falk, H. and Kayser, H. (1987) J. Mol. Biol. 198, 499-513. [16] Holden, H.M., Rypniewski, W.R., Law, J.H. and Rayment, I. (1987), EMBO J. 6, 1565-1570. [17] Bleasby, A.J. and Wootton, J.C. (1990) Protein Eng. 3, 153-159. [18] Jones, T.A., Zou, J.Y., Cowan, S.W. and Kjeldgaard, M. (1991) Acta Crystallogr. A47, 110-119. [19] Kabsch, W. and Sander, C. (1983) Biopolymers 22, 2577-2637. [20] Bairoch, A. (1993) Nucleic Acids Res. 21, 3097-3103. [21] Akrigg, D.A., Attwood, T.K., Bleasby, A.J., Findlay, J.B.C., North, A.C.T., Parry-Smith, D.J., Perkins, D.N. and Wootton, J.C. (1992) CABIOS 3, 295-296. [22] Deveraux, J., Haeberli, P. and Smithies, O. (1984), Nucleic Acids Res. 12, 387-395. [23] Gamier, J. and Robson, B. (1989) in Prediction of Protein Structure and the Principles of Protein Conformation (Fasman, G.D., ed.), pp. 417-465, Plenum Press, New York. [24] Biosym Technologies Inc. (1991) INSIGHT II molecular modelling software, San Diego, CA, USA. [25] Sacchettini, J.C., Gordon, J.I. and Banaszak, L.J. (1989) J. Mol. Biol. 208, 327-339. [26] Fukuyama, K., Matsubara, H. and Rogers, L.J. (1992) J. Mol. Biol. 225, 775-789. [27] Igarashi, M., Nagata, A., Toh, H., Urada, Y. and Hayaishi, O. (1992) Proc. Natl. Acad. Sci. USA 89, 5376-5380. [28] Morel, L., Defauvre, J.P. and Depeiges, A. (1993) J. Biol. Chem. 268, 10274-10281. [29] Kaumeyer, J.F., Polazzi, J.O. and Kotick, M.P. (1986) Nucleic Acids Res., 7839-7850. [30] Hambling, S.G., McAIpine, A.S. and Sawyer, L. (1992) in Advanced Dairy Chemistry (Fox, P.F., ed.), Voi. 1, Chap. 4, pp. 140-190, Elsevier, London. [31] Halliday, J.A., Bell, K., McAndrew, K. and Shaw, D.C. (1993) Protein Seq. Data Anal. 5, 201-205. [32] Godovac-Zimmermann, J. and Shaw, D.C. (1987) Biol. Chem. Hoppe-Seyler 368, 879-886. [33] Cooper, R., Eckley, D.M. and Papaconstantinou, J.J. (1987) Biochemistry 26, 5244-5270. [34] Padlan, E.A. and Kabat, E.A. (1988) Proc. Natl. Acad. Sci. USA 85, 6885-6889. [35] Livnah, O., Bayer, E.A., Wilchek, M. and Sussman, J.L. (1993) Proc. Natl. Acad. Sci. USA 90, 5076-5080. [36] Liithy, R., McLachlan, A.D. and Eisenberg, D. (1991) Proteins 10, 229-239. [37] Gum, J.R., Hicks, J.W., Swallow, D.M., Lagace, R.L., Byrd, J.C., Lamport, D.T.A., Siddiki, B. and Kim, Y.S. (1990) Biochem. Biophys. Res. Commun. 171, 407-415.