Quantifying Structure–Function Uncertainty: A Graph Theoretical Exploration into the Origins and Limitations of Protein Annotation

Quantifying Structure–Function Uncertainty: A Graph Theoretical Exploration into the Origins and Limitations of Protein Annotation

doi:10.1016/j.jmb.2004.02.009 J. Mol. Biol. (2004) 337, 933–949 Quantifying Structure –Function Uncertainty: A Graph Theoretical Exploration into th...

2MB Sizes 0 Downloads 2 Views

doi:10.1016/j.jmb.2004.02.009

J. Mol. Biol. (2004) 337, 933–949

Quantifying Structure –Function Uncertainty: A Graph Theoretical Exploration into the Origins and Limitations of Protein Annotation Boris E. Shakhnovich* and J. Max Harvey Bioinformatics Program Boston University, Boston MA 02215, USA

Since the advent of investigations into structural genomics, research has focused on correctly identifying domain boundaries, as well as domain similarities and differences in the context of their evolutionary relationships. As the science of structural genomics ramps up adding more and more information into the databanks, questions about the accuracy and completeness of our classification and annotation systems appear on the forefront of this research. A central question of paramount importance is how structural similarity relates to functional similarity. Here, we begin to rigorously and quantitatively answer these questions by first exploring the consensus between the most common protein domain structure annotation databases CATH, SCOP and FSSP. Each of these databases explores the evolutionary relationships between protein domains using a combination of automatic and manual, structural and functional, continuous and discrete similarity measures. In order to examine the issue of consensus thoroughly, we build a generalized graph out of each of these databases and hierarchically cluster these graphs at interval thresholds. We then employ a distance measure to find regions of greatest overlap. Using this procedure we were able not only to enumerate the level of consensus between the different annotation systems, but also to define the graph-theoretical origins behind the annotation schema of class, family and superfamily by observing that the same thresholds that define the best consensus regions between FSSP, SCOP and CATH correspond to distinct, non-random phase-transitions in the structure comparison graph itself. To investigate the correspondence in divergence between structure and function further, we introduce a measure of functional entropy that calculates divergence in function space. First, we use this measure to calculate the general correlation between structural homology and functional proximity. We extend this analysis further by quantitatively calculating the average amount of functional information gained from our understanding of structural distance and the corollary inherent uncertainty that represents the theoretical limit of our ability to infer function from structural similarity. Finally we show how our measure of functional “entropy” translates into a more intuitive concept of functional annotation into similarity EC classes. q 2004 Published by Elsevier Ltd.

*Corresponding author

Keywords: protein domain; annotation; graph theory; database, structure – function; evolution

Introduction Supplementary data associated with this article can be found at doi: 10.1016/j.jmb.2004.02.009 Abbreviations used: GO, gene ontology; FSS, functional flexibility score. E-mail address of the corresponding author: [email protected] 0022-2836/$ - see front matter q 2004 Published by Elsevier Ltd.

The problem of protein domain evolution is not a recent one, dating at least to 1970s and 1980s when researchers noted that both protein and DNA sequences have large conserved or nearly conserved parts.1,2 These conserved parts were

934 called motifs.3,4 Later, when three-dimensional data became available in large amounts, researchers noted that not only the sequences but also the structures of proteins were conserved.5 – 8 This led to the idea that these conserved parts were atomic units of evolution, i.e. evolutionary pressures were not brought upon the proteins, but these conserved parts dubbed “domains” that could later recombine to form proteins.9 – 12 Equivalent to solving the problem of protein domain evolution is the problem of annotation of protein domains.13 – 15 Both problems hinge in part on finding the relative strengths of pressures that yield the most accurate evolutionary distance measure between protein domains.16,17 Since evolutionary pressure is exerted on both structure and function, the distance measure most representative of the co-evolution of structure and function would then produce a reliable system of domain annotation by homology modeling. In particular researchers have to know how quickly and in what order domain evolution progresses in order to reliably infer function of novel domains from structure or sequence homology data. However, due to the complex nature of the structure –function relationship and the non-linear progression of protein domain evolution,17 – 19 distances are currently calculated qualitatively on the basis of many shared characteristics such as sequence, structure and function. This annotation schema yields a hierarchical organization of data where domains sharing sequence are closest, followed by domains sharing function and then structure in descending degree of proximity.20,21 Even when combining comparisons of many characteristics, problems still remain that obscure our ability to fully understand protein domain evolution by homology modeling.22,23 For example, problems such as lack of understanding for the relative prevalence of divergence versus convergence24 – 26 in domain evolution prevent quantitative modeling of the rate of divergence of structure and function. However, a major insight from previous research into protein domain annotation was the result that pointed to the existence of a hierarchical organization that best describes, in general, the relationships between different domains,27,28 i.e. there are certain characteristics that define clear delineation of classes from folds, superfamilies and families of domains.29,30 These were easily discernable and largely conserved macro characteristics such as secondary structure composition or conservation of function and sequence homology. Most groups that do research in this area agree on this general representation of the domain universe.31,32 However, due to the problems mentioned above, automatic assignment of new domains into this annotation system proved to be a daunting task that was accomplished only recently with some success using neural network analysis and statistical clustering algorithms.33,34 With all the successes in protein domain assignment and comparison, there have been relatively

Quantifying Structure –Function Uncertainty

few breakthroughs in the area of quantifying the structure –function relationship. Recently we examined the co-evolution of structure and function by noting that adding structurally homologous domains to clusters in protein domain universe graph (PDUG) added to the significance of a functional fingerprint of that group of domains.18 The problems of structure comparison and functional annotation from that comparison are related since the origin of a superfamily (a set of domains that share a loose functional similarity) starts from a common ancestor and diffuses in structure– function space.26 Therefore, the definition of a functional neighborhood is intimately related to the proper construction and delineation of the structural one.18 Here, we address both the question of proper delineation of a structural neighborhood and the question of how much function changes with respect to a change in structure. Since “function” is not an easily quantifiable characteristic, we introduce a novel measure of functional divergence called “functional entropy”. This value roughly measures the number of different functions performed by an arbitrary set of sequences or structures. Functional entropy enables us to quantifiably correlate the divergence of structure and function. Understanding the structure– function relationship is a key question that is at the root of computational biology and bioinformatic research. Methods developed here can be applied to increase our understanding of whole genome annotation following sequencing.13,14,35,36 Ability to annotate new genes in recently sequenced genomes hinges in part on our ability to understand the relative rate of divergence of sequence, structure and function, since even a small amount of divergence from the closest homologue raises a possibility of functional change.14,36 With sequence alignment nearing the limit of resolution and as fewer new genes are amenable to functional annotation from sequence alignment, new functional inference will come from our increased understanding of structure– function relationship.37 In particular, researchers can compute the amount of functional information that can be gained from sequence or structure homology modeling and the amount that is intrinsically uncertain due to divergence. More intuitively, one can think of this process as the ability to place structures from a newly sequenced genome into functional neighborhoods. The precision of the answer can be evaluated with the methods presented here and is trivially dependent on the amount of divergence from the closest known homologous sequence or structure. Thus, we attempt to address some of the problems posed above. What is the level of consensus between manual curation of protein domain evolution and automatic assignment? What do the similarities between these methods tell us about protein domains and their evolution? Where do the differences come from? Can we explain the theoretical origin and limitation of our ability to automatically assign annotations using neural

Quantifying Structure –Function Uncertainty

networks or statistical methods? What are the limitations of functional annotation from structural comparison? What is the underlying principle that may explain the commonalities between structure and function relationships?

Protein Domain Databases as Clustered Graphs The first question we tackle is a rigorous evaluation of consensus between annotations of structures by the most commonly used databases. In order to evaluate the consensus in protein domain assignment, we consider a few of the most widely used annotation systems and databases. There are a multitude of protein domain databases, however, we focus on three: SCOP,38 CATH,27 FSSP.39,40 This choice is made not to single out some databases versus others, but because they represent a range from full manual curation combining phylogeny, sequence, structure and function data to fully automatic assignment based only on computerized structure comparison. There also exists a dichotomy of annotations with discrete SCOP and CATH assignment of class, architecture, fold, superfamily, family and continuous structure comparison methods of FSSP. Comparing the results from differing methods in this range will allow us to gain important insights into the origins of the relationship between manual curation and automatic assignment, discrete versus continuous as well as glean the first insights into the correlation between structure homology and function of protein domains. SCOP38 is one of the first databases created that describes protein domains and their evolutionary relationships. The classification is based on a hierarchical system where all domains are initially split into 11 categories based on their predominant secondary structure elements such as all alpha or all beta. This coarseness of classification is called the Class. The next level of categorization is based mostly on conserved secondary structure motifs where all domains sharing a particular motif are classified in the same Fold. Further categorization is mostly based on function and long-range sequence conservation and is dubbed Superfamily and Family. To build a graph out of SCOP we take all pairs of domains and calculate their proximity according to how many levels in the SCOP hierarchy the two domains share. For example, if the domains fall into the same Superfamily but are annotated as belonging to different Families the proximity is equal to 3, since they share the first three levels of SCOP annotation (Class, Fold and Superfamily). Therefore, in this graph domains with larger proximity are closer together. For CATH we perform a very similar procedure as for SCOP. CATH is also organized into four hierarchically defined levels. The first is Class that is based on the predominant makeup of secondary structure. Architecture and Topology describe

935

more precisely the secondary structure motifs. The Homologous Superfamily level is meant to represent common evolutionary ancestry and almost always conserves the general description of function.41 We go on to build a graph from CATH, analogous to the one built from SCOP. We calculate the proximity for each pair of domains based on the number of common classifications they share in the CATH hierarchy. Naively one may think that the levels of CATH would correspond equivalently to levels in SCOP. The first hint that this may not be the case is that CATH has only four Classes while SCOP has 11. Unlike the fully manual curation in SCOP, CATH employs manual curation as well as automatic assignment to evaluate the proper placement of new domains into the hierarchy of the database. Finally, FSSP39 is a database that utilizes the DALI42 structure comparison engine to fully automate the similarity comparison between protein domains. DALI uses contact maps that describe the Euclidean distance between all pairs of amino acid residues to compare the structures of the protein domains. Obviously, this does not yield a discrete, hierarchical description in the same sense as SCOP38 and CATH.27 A “continuous” distance measure between any two domains is defined based on how closely related their contact maps are. This distance is called the Z score and measures the normalized probability that two contact maps have a certain number of common contacts with respect to a distribution of comparisons between random contact matrices. It is important to note that since FSSP is fully automated and is based purely on mathematical formulation of distance between structures (or their contact matrix equivalents) no information about functionality or sequence homology to any other domain is included in this “annotation”. It is also important to note that FSSP does not have a “hierarchical” annotation scheme like SCOP and CATH but forms a comparison graph. Noting this, we build a graph out of FSSP analogous to the graphs built from SCOP and CATH where the nodes are domains and the edges are weighted with Z score between the structures of those domains. This graph is analogous to the PDUG described earlier.43 We will use this graph to find the best Z score measure of structural similarity corresponding to the discrete assignments of Fold, Class and Superfamily employed by SCOP and CATH. Through the above graph-morphing procedures for SCOP, CATH and FSSP we end up with three weighted graphs, one for each database. The nodes in each graph are the protein domains and the edges are the relationships defined by distances or proximity from each database. We proceed to cluster these graphs at regular interval cutoffs. For example, for FSSP we build a graph at each threshold from Z ¼ 2 to 16 with step 0.5. In order to do this, we pick a cutoff and keep all edges that are larger than this cutoff.44 A cluster is then a set of domains connected by edges with weights larger

936

Quantifying Structure –Function Uncertainty

Figure 1. Schematic representing the building and clustering of a protein domain graph. The smaller graph on the right represents all domains connected by edges specified by the FSSP database. The blown up portion in the red circle shows domains with weighted edges. Clustering the graph amounts to searching for sets of domains where every domain has a path to every other domain in the same set. For example, if the clustering were done at cutoff 10 then the domains in green would represent one cluster while the domains in blue another. The edges depicted with ripples do not survive the cutoff procedure. Notably, while the edge with weight 5 does not survive the cutoff procedure, all domains in green form a cluster, since there still exists a path from each of the domains in green to every other green domain through edges with weights larger than the cutoff.

than cutoff (Figure 1). The only requirement for a cluster is the existence of a path from every domain in the cluster to every other domain in the same cluster. For example, for SCOP and CATH graphs at cutoff 3 we group all domains that share the same SCOP Superfamily and CATH Topology. We then proceed to turn each clustering into a square matrix where i,j is 1 if domain i and domain j are in the same cluster and 0 otherwise (equation (1)). We build one matrix for each graph at each cutoff that we consider. Our goal is then to compare these matrices and quantify the degree of correlation between them: 0 1 a11 · · · 0 B C B . .. .. C NT ¼ B .. C . . @ A 0 ( a i ,j !

· · · ann

ð1Þ

two undirected graphs. This problem is analogous to one of comparing two binary strings (or in our case binary matrices) where each digit represents the existence of the pair of domains in the same cluster. Let us remind ourselves that if the two domains i and j exist in the same cluster the value of the i,j element in the matrix is 1 otherwise it is 0 (equation (1)). We can use this notation to define a few important quantities that will enable us to compare two graphs. A true positive (TP) is when both graphs place the two domains i and j into the same cluster, i.e. the value Mij in both matrices is 1. A false positive (FP) and false negative (FN) is when one graph puts the two domains into the same cluster, while the other does not. True negative (TN) is when both graphs have the two domains in different clusters. This is summarized in Table 1 and can be calculated using equation (2) for each pair of matrices N and M clustered at

1 if domain i,j belong to the same cluster

0 otherwise Here N is the matrix built from clustering the graph at threshold T.

Table 1. A sample truth table that is built for every domain pair in graph 1 and graph 2

Comparison of Protein Domain Annotation Graphs

TP FP FN TN

Since we built and clustered the protein domain databases in the form of graphs, we start tackling the problem of comparing the three databases as a particular instance of finding the distance between

1 symbolizes that the domains are in the same cluster while 0 indicates that they are in different clusters. For every domain pair in common between two graphs we calculate whether the annotation is a TP, FP, FN or TN.

Name

Graph 1

Graph 2

1 1 0 0

1 0 1 0

937

Quantifying Structure –Function Uncertainty

thresholds T1 and T2, respectively. In order to calculate the distance between two graphs we have to compare the number of TP versus FP and FN that the comparison between the two graphs yields: X ð2Þ NijT1 ^ MTij 2 TPTN1,,MT2 ¼ i,j[{1::n} X

T2 T2 FNTN1,,M þ FPTN1,,M ¼

ðNijT1 _ MTij 2 2 NijT1 ^ MTij 2 Þ

i,j[{1::n}

TNTN1,,MT2 ¼ lNl 2 2

X

NijT1 _ MTij 2

i,j[{1::n}

Here N and M represent the two matrices describing clusterization of two different databases (equation (1)) at T1 and T2 that are the two thresholds where we clustered the graphs. After the quantities shown above have been defined, the distance measure between two graphs is merely a calculation of how many TPs the two graphs share with respect to FNs and FPs. This measure is meant to calculate the level of agreement between the two graphs with respect to how many domain pairs they classify in the same cluster. Of course this is only the first approximation of distance. A true measure of distance between two clusterizations would also search for all combinations of similarly annotated triples, quadruples …n-tuples of domains. Since that is too computationally expensive to be viable, we satisfy ourselves with only the first-order approximation. We pick the simplest measure that compares the TP to FP and FN, e.g. Jaccard defined as: J¼

TP FP þ FN þ TP

of domain pairs annotated in the same cluster versus different clusters between the two graphs when each is clustered at their respective thresholds. For example, at cutoff pair {2,3} on the SCOP –CATH distance landscape the domain pairs that are annotated with the same SCOP Fold are compared to the annotations of the same domain pair into the same CATH Topology (Figure 2). The maxima of these landscapes would then indicate the threshold pairs ðT1 ,T2 Þ where the greatest overlap between the two annotation systems occur. This, in turn, means that two databases share the greatest consensus when clustered at those thresholds. A SCOP – CATH landscape is shown in Figure 2. By fixing the cutoff in one graph and looking for the maximum of that slice we find the best matching cutoff in the other graph by solving equation (4): ›JNT1 ,M ¼0 ð4Þ ›T2 T1 ¼T Here, JNT1 ,M is the Jaccard distance between graph N holding threshold T1 constant at T and graph M. T2 is the second, variable threshold. For example, when holding SCOP and CATH and keeping SCOP threshold T1 ¼ 2 the maximum occurs at CATH cutoff 3. Thus, we can deduce from Figure

ð3Þ

While there are many measures that can perform a similar task of measuring distance between two clusterizations, we have checked that the choice of the particular measure does not affect the conclusions here (data not shown). It is, however, important to note that in contrast to measures used in other comparable studies, this measure is reflexive, i.e. it does not depend on the direction of comparison. On the other hand, since it counts both TPs and TNs, the actual quantity behaves as though less than those generated by one-sided comparison measures such as sensitivity and specificity.31,32

Distance Landscapes Between Graphs Using the Jaccard distance measure defined above, we can now calculate the distance landscapes for any two databases with respect to every pair of cutoffs. We calculate the Jaccard distance using equation (3) for every N, M pair of databases at every T1 ,T2 pair of thresholds. Thus, the Jaccard distance at each pair ðT1 ,T2 Þ of cutoffs is the ratio

Figure 2. Jaccard distance landscape for SCOP– CATH comparison. The SCOP and CATH axes represent the cutoffs ðT1 ,T2 Þ where the graphs were clustered. The larger the Jaccard distance the closer the two graphs are. The analysis was done on 3306 domains that are shared by SCOP and CATH disregarding .25% sequence homologues within databases. Thus, the nodes were defined as those domains that shared primary sequences determined by a two-way stringent BLAST between SCOP and CATH and then the edges and weights were defined as described above.

938

2 that the Fold annotation in SCOP (cutoff 2) best overlaps with Topology level in CATH (cutoff 3) not the Architecture level (cutoff 2). We can also deduce that the Architecture level (cutoff 2) for CATH corresponds to an “intermediate” annotation between SCOP Class and Fold. These results are not very surprising and are shown here mostly to validate the method. However, it is interesting that the greatest overlap as measured by Jaccard between SCOP and CATH lies at the Family level of SCOP and Superfamily level of CATH. This could indicate that the evolutionary boundaries are best defined on that level of description, or perhaps that this annotation level, in part based on sequence alignment for both databases, uses the same methodology for both SCOP and CATH to bring about the best overlap in results. It may also mean that Superfamily level that is defined in part based on functional conservation exhibits less convergence than levels above. We go on to utilize the same methodology to compare the fully automatic structure comparison measure of Z scores from FSSP to CATH and SCOP classifications (Figures 3 and 4). First, by using the continuous distance measure of structural comparison in FSSP, we can find non-trivial correlations of the continuous structural alignment to the discrete annotation of SCOP and CATH. Since CATH cutoff 4 and SCOP cutoffs 3 and 4 describe largely conserved protein domain function, the maximum of FSSP with respect to those cutoffs represents the level of structure conservation that also yields general functional conservation. In this way we can propose a preliminary definition of the structure– function relationship of protein domains. We will explore this relationship in more detail below.

Figure 3. The distance landscape for SCOP, FSSP comparison. There are three well-defined cusps and maxima that can be seen in two dimensions as well in Figure 5. These points occur at ðZ ¼ 6,SCOP ¼ 2Þ, ðZ ¼ 9,SCOP ¼ 3Þ and ðZ ¼ 11,SCOP ¼ 4Þ: There are 2800 domains in the dataset representing the overlap in annotated domains between SCOP and FSSP.

Quantifying Structure –Function Uncertainty

Figure 4. The distance landscape for FSSP and CATH graph comparison. There are only two cusps or maxima in this graph: ðZ ¼ 6; CATH ¼ 3Þ and ðZ ¼ 9,CATH ¼ 4Þ: This dataset contained 2400 domains that represent the domains commonly annotated by both FSSP and CATH.

The above distance landscapes (Figures 3 and 4) tell a very interesting story. First, we can see that the landscapes follow a transitivity property. For example, the largest overlap between SCOP level 3 is with CATH level 4 (Figure 2), we can see that the maximum for SCOP level 3 is FSSP Z ¼ 9 (Figure 3) which is also the maximum with CATH level 4 (Figure 4). This is encouraging because it shows consistency of results. However, the really interesting observation is that there are welldefined, non-overlapping cusps or maxima representing clustering of FSSP at those thresholds that correspond best to assignments of Fold, Family and Superfamily in CATH and SCOP. These cusps or maxima for Fold, Superfamily and Family occur at around FSSP Z ¼ 6, 9 and 11 (Figure 5). This is surprising since FSSP is a continuous, fully automatic structural comparison between protein domain structures, i.e. there is no a priori reason to expect certain distances to be uniformly favored over others in the structure comparison space. Thus, we were able to define the best structural similarity threshold as defined by FSSP for each hierarchical level of manual annotation for both SCOP and CATH. This finding suggests that there are clear structural comparison thresholds where both CATH and SCOP annotations agree are indicative of the change in hierarchical level of description. In other words, the discrete annotations “defined” by SCOP and CATH are not arbitrary, but correspond to well-defined clustering thresholds of structures as measured by automatic comparison using Z scores.42 At this point we can make our first observations about the structure –function relationship. Since conservation of general functionality occurs at the Homologous Superfamily level and beyond in

Quantifying Structure –Function Uncertainty

939

Figure 5. Four slices of the 3D graph depicted in Figure 3. The cusps and maxima are easily discernable from these slices by solving equation (4). At SCOP cutoff 1 the Jaccard distance is actually smaller than the random control indicating that this level of annotation is probably not indicative of real evolutionary homology and may not indicate meaningful annotation. At SCOP cutoff 2, 3 and 4 the Jaccard distance between FSSP and SCOP is many thousands of standard deviation away from random. At SCOP cutoff 2 (Fold level) the cusp occurs at Z ¼ 6, at SCOP cutoff 3 (Superfamily level) the maximum occurs at Z ¼ 9 and SCOP cutoff 4 (Family level) the maximum occurs at Z ¼ 10 – 11:

the CATH annotation hierarchy, we can note that the general functional conservation occurs at structural similarity of Z ¼ 9 and beyond. While this is not a quantitative argument, it serves as an example of how comparison of automatic and manual assignment can be used to glean new information not originally present in either database explicitly. We will explore the issue of structure– function relationship in more detail in later sections. Finally, it is worth noting the underlying reasons for cusp formation at SCOP threshold 2 while more pronounced maxima exist at overlaps with later levels in the SCOP hierarchy. As mentioned earlier, the Jaccard measure of distance is only a first approximation of proximity between the two clustered graphs. One peculiarity of this measure is that an improper annotation of a single domain has a different influence on the overall distance depending on whether that annotation was made improperly into/out of a large cluster or a small cluster. For example, if the improper annotation placed a domain into a cluster originally composed of 220 domains the number of FP from this mis-

annotation would be 220; however, if the misannotation only added a single domain to a cluster originally containing an orphan the number of FP would rise by only 1. In both cases the mistake is “singular”. Since Z scores of 6 and below retain most of the large clusters of structurally similar domains intact only adding single domains to small clusters, the Jaccard graph shows a cusp due to the relatively small contribution or underrepresentation of FP pairs. The argument presented above introduces a certain amount of ambiguity into comparison of different pairs of graphs to each other. As can be easily seen from Figure 5 the cusps and maxima are not extremely pronounced. While the placement of exact overlap values is probably tricky, we are confident of the approximate value. Also, since Z scores continuously change the graph by splitting existing clusters into smaller subsets, we would expect a smooth transition along the Z clustering axes as observed. However, it is also worth noting that standard deviation for graphs as large as PDUG are in the order of SDðJÞ ¼ 1 £ 1026 , thus a small change in the Jaccard value for different

940

thresholds along Z can be shown to be highly nonrandom (results not shown).

Origins of Hierarchy: Phase Transitions in the Graph In order to figure out why the graphs created using distance from automatic, continuous structural comparison measure Z clustered at particular thresholds correspond to separation of structure into well-defined hierarchical classes, we have to research the properties of the FSSP graph itself.43 First, we investigate the size of the largest cluster with respect to the cutoff distance. Then we compare this to the size of largest cluster that we would expect at random. To calculate the expected size of the largest cluster for the random graph, we build 200 random graphs at each cutoff. To build a random graph we take the number of nodes and the number of edges at each cutoff and redistribute the edges randomly by picking two random nodes until no edges are left, as described in detail elsewhere.45 – 47 We then cluster this random graph as before44 (Figure 1) and record the

Quantifying Structure –Function Uncertainty

size of the largest cluster. Figure 6 plots the mean of these measurements with error bars as one standard deviation. The behavior of the size of the largest cluster (Figure 6) and its difference with random bears a striking resemblance to the maxima we just observed on the distance landscapes between the three databases (Figures 3 and 4). We can see that there are two very pronounced phase transitions in the size of the largest cluster. The first is from FSSP Z ¼ 6 to Z ¼ 9 and the second is from Z ¼ 10 to Z ¼ 14: These represent the starting and ending points where the largest cluster “suddenly” breaks up into much smaller clusters the largest of which is almost 50% of the “parent”. The size of the largest cluster in the random graph is always much larger than the size of the largest cluster in the real graph up until Z . 12: Because of this we will argue that the third and final non-random transition occurs at around Z ¼ 11: The behavior of the other clusters closely mirrors that of the largest cluster thus showing that the phase transition is not just the function of the major superfolds but of the majority of the PDUG graph. It is interesting that the first three largest

Figure 6. (a) The size of the largest cluster in FSSP graph plotted against the similarity cutoff threshold at which the graph is clustered. The size of the largest cluster in the random graph is plotted in black as the mean with error bars as one standard deviation from the mean. The sampling was done for 200 random graphs. It is worth noting that the size of the largest cluster in the random graph is larger than the largest cluster in FSSP until the end of the phase transition at Z ¼ 12: This is due to the power-law nature of the FSSP graph.43 (b) The size of the first six largest clusters plotted together as percentage of their original size. The computation was done by ordering the sizes of the clusters at each cutoff and plotting the largest six. The largest six clusters account for vast majority of the domains that are not orphans (singletons). It is worth observing that all the phase transitions occur between Z ¼ 6 and Z ¼ 9:

Quantifying Structure –Function Uncertainty

clusters transition at around Z ¼ 9 while the smaller three transition closer to Z ¼ 6: While we have not explored the exact reasons behind this observation we hypothesize that the larger clusters are either still evolving and “adding” new domains or have evolved more recently while the smaller three have “stagnated” and stopped growing. This observation is a direct consequence of the higher Z value of the transition. The higher value may indicate either that the domains diverged more recently or that their speed of divergence is smaller. As further evidence that the phase transitions are a function of the whole space we plotted the slope of the linear fit of the distribution of cluster sizes on a log – log plot. We know from previous work that the distribution of cluster sizes is a power-law easily fitted with a line on a log – log plot. Thus, by changing the Z cutoff we also change the distribution of the cluster sizes. We present the slope of the fitted line with respect to Z in Supplementary Material. Not surprisingly we observe that there is a transition around Z ¼ 9 indicative of a change and breakup of the graph from large to small clusters. Not coincidentally, the Z scores where all these transitions occur are ones that have the largest overlap with the manually curated SCOP and CATH assignment (Figures 3 and 4). For example, Z ¼ 6 has the greatest overlap with SCOP Fold (cutoff 2) and CATH Topology cutoff 3. Z ¼ 9 has the greatest overlap with SCOP Superfamily (cutoff 3) and CATH Homologous Superfamily (cutoff 4) and Z ¼ 10 is the point of the greatest overlap with SCOP Family annotation (cutoff 4) (equation (4) and Figures 3– 5). Thus, by comparing the graphs of FSSP, SCOP and CATH we were able not only to correlate manually curated, discrete assignments to a completely automatic structural comparison distance but also find an independent justification for the hierarchical annotation scheme used by these databases. The fact that correlation exists and that the FSSP graph behaves in such a non-random fashion probably indicates that protein domain evolution has a “natural” clustering which explicitly defines the hierarchical annotation. At these thresholds all domains inside clusters are much closer to each other than to other domains. This could be because of the manner in which protein domains evolved and is consistent with some recent models describing evolution by punctuated equilibrium48 (Tiana et al., unpublished results). Finally, we explore the consensus between automatic assignment of Fold described by Dietmann & Holm using neural nets34 and the hypothetical assignment of Fold from the FSSP graph clustered at the Z ¼ 6 corresponding to the transition threshold. In order to do this we perform analysis very similar to those performed for the comparison between FSSP and SCOP and CATH. We take the FSSP graph and cluster at each threshold between 2 and 16 with step 0.5. All domains that fall in the same cluster are then annotated with the same

941

“Fold”. This assignment is then compared to the Fold assignment from the Dali Domain Dictionary obtained from neural net analysis by Dietman & Holm34 using the methodology described above (equation (3); Figure 7). Using only structural information and clustering at the phase-transition threshold of Z ¼ 6 we were able to reach a higher level of consensus with annotation of Fold from Dali34 than between Dali and SCOP where the Jaccard measure is equal to 0.30, which is analogous to the value for FSSP comparison at thresholds Z ¼ 6 and SCOP ¼ 2 (Figures 4 and 5 and an even larger consensus than between SCOP and CATH (Figure 2). Thus, we argue that at least the Fold and Superfamily levels of domain annotation have intrinsic origins in the highly uneven distribution of structure space.43 This uneven distribution can be quantitatively observed on the phase transition graph of the size of the largest cluster with respect to Z score cutoff. It is hard to pinpoint the origin of the discrepancy in Fold assignment between the different annotation systems. However, the fact that the quantity of consensus is similar between all systems of comparisons may indicate an intrinsic level of uncertainty in our ability to assign Fold classes to some domains.33 Different speeds of divergence for different clusters may also contribute to mis-annotation of some domains into incorrect clusters. However, it could also show an intrinsic level of convergence in the evolution of protein domains.

Figure 7. Jaccard distance between FSSP partitioning at threshold and Dali Domain Fold assignment as determined using neural networks by Dietmann & Holm.34 The greatest overlap occurs at Z ¼ 6 with Jaccard ¼ 0.64. This occurs at the same Z score where the first phase transition in the structural graph starts and where the greatest overlap with SCOP Fold annotation occurs once again indicating the non-arbitrariness and theoretical origin of Fold assignment. The overlap is four orders of magnitude larger than would be expected by random and larger than the consensus of Fold assignment between SCOP and CATH (Jaccard ¼ 0.45; Figure 2).

942

Quantifying Structure –Function Uncertainty We continue our analysis by methodically exploring the structure– function relationship. From our previous analysis we were able to get a rough approximation of structural similarity that yields general functional conservation at around Z ¼ 9, the Superfamily level of CATH. We now ask the question of how and to what extent

Quantifying Structure –Function Uncertainty

structure conservation affects function conservation. We reason that if the structural comparison graph undergoes characteristic transitions related to evolutionary divergence function may also undergo similar transition due to the intrinsically similar mechanisms underlying functional divergence.18 What we observed in our previous work was that divergence of structure also diverges function. If we note all the functions that domains in a particular cluster perform, we can use this to define a “fingerprint” of possible

Figure 8. The schematic representation of how we calculate FFS. We take all domains that fall into a particular cluster, find all sequences that fold into those domains by stringent sequence comparison with . 25% sequence identity and annotate them with their function using InterPro. We calculate a weighted GO tree from these functions and use that to calculate the FFS using equation (5). We can use the procedure for calculating FFS for individual domains to calculate the overall functional entropy of a particular clusterization at each cutoff of FSSP using a slight modification to equation (5) (equation (6); Figure 9). To do this, we combine all sequences from each cluster and build a single GO tree for each cluster such that each domain in the cluster adds the same “amount of information” to the final GO tree: X X 1 1 X FFSðClusterÞ ¼ 2 ð7Þ pij logðpij Þ MaxðLÞ N l i[{nodes on Level l} j[{1::N } Here, N is the total number of domains in the cluster and pij is the fraction of the sequences of domain j that are annotated with function i.

Quantifying Structure –Function Uncertainty

943

functions a novel domain may perform that is placed in the same cluster. Thus, if we identify a domain that can be placed reliably in a structural neighborhood we can also define a probabilistically weighed set of functions that this homologue may be involved in, governed by the functional fingerprint of that cluster.18 We want to quantify this behavior and see what if any conclusion we can draw from it about the underlying structure– function correlation. In order to approach this problem quantitatively, we introduce the concept of entropy in function space called the functional flexibility score or FFS (Figure 8). This quantity is a direct extension of the “functional fingerprint” that was introduced in our earlier work.18 This quantity is meant to measure the divergence of function for an arbitrary set of sequences. In order to calculate functional entropy of a domain, we start by combining all sequences that fold into a particular structure into a set. We then match these sequences to InterPro49 equilogs (sequences with the same function). We reconstruct a gene ontology (GO)50 tree from the annotations of equilogs and calculate the number of equilogs of the family that is assigned a particular functional annotation normalized by total number of annotations at each level (equation (5); Figure 8). Using the formalism of FFS we can calculate the average amount of information per level on the GO tree needed to fully describe the functionality of each domain using the following equation: X X 1 pi logðpi Þ ð5Þ FFS ¼ 2 MaxðLÞ l i[{nodes on Level l} Here MaxðLÞ is the maximal number of levels of annotation, summation is taken over all levels l and over all nodes i filled by the domain on the GO tree, pi is the percentage of the sequences homologous to that domain that is annotated with function I (Figure 8). Larger FFS represents domains whose sequence homologues do very diverse function, while low FFS signifies very functionally coherent domains. First, we want to explore in general the extent of the structure –function correlation. We calculate the functional entropy of all clusters at every threshold of FSSP from 2 to 16 with step 0.5 and compare that with functional entropy expected at random for a cluster that size (Figure 9(a)). To do this, we cluster the FSSP graph at all thresholds and then calculate FFS for each cluster using equation (6). In order to calculate FFS expected at random we choose the corresponding number of domains and calculate FFS using equation (6). We clearly see in Figure 9(a) and (b) that there is a dramatic difference between the functional entropy observed by chance and observed on the same-size clusters that share structural similarity. The Kolmogorov Smirnov test between the two samples shows that with P , 1 £ 10250 they come from different distributions (Figure 9(b)). Thus, perhaps not surprisingly, we can “prove” decisively that there is a

Figure 9. (a) The entropy of random clusters in gray plotted with functional entropy of real clusters of FSSP in red clustered at some structural similarity score with respect to cluster size. The real clusters were sampled from FSSP graphs clustered at all Z scores from 2 to 16 every 0.5 as described above. The FFS was calculated for all domains in the cluster using equation (6) for both random and real clusters. The best fit-lines show that the entropy of random clusters follows FFS ¼ 0:9ðln X þ 1Þ while real clusters follow FFS ¼ 0:65ðln X þ 1Þ where X is the number of domains in the cluster. The best-fit lines have an R 2 value of 0.996 and 0.8 for random and real clusters, respectively. Random clusters were constructed by randomly picking domains until the size of the cluster was reached. (b) The distribution of FFS scores for random and real clusters. The real clusters are clearly distributed along a much smaller functional entropy indicating that there is structure– function relationship. The difference of means is around 1.8 bits while the KS test yields P , 1 £ 10250 that the two distributions come from the same underlying distribution.

structure–function correlation in the sense that domains that fall in the same structural cluster require less information to explain their collective function than comparative clusters of domains picked at random. This also means that there is

944

Quantifying Structure –Function Uncertainty

correlative divergence of structure and function. We can also quantify on average the amount of information about function gained from knowledge of structure; it is on average 1.8 bits per level on the GO tree (Figure 9(b)). Next, we set to quantify the exact amount of information that we gain about function from a particular level of structural comparison. Next, we set out to evaluate the amount of information we gain from structural comparison at each Z threshold. In order to do this, we calculate the difference between information needed to explain the functionality for each domain in structural clusters and what we would expect by random for that ensemble of clusters. We take each clusterization at interval Z cutoff and compare the entropy of each real cluster with one we would expect by random and normalize by the number of domains in that cluster, thus computing the information gained per domain at that threshold (equation (7); Figure 10(a)): X FFSRC 2 FFSC ð6Þ Gz ¼ C¼{1::N } Here Gz is the gain in information for that Z threshold, FFSRC is the functional entropy we would expect at random for cluster with size lCl, FFSC is the entropy observed for that cluster. It is worth noting that information gain changes drastically only from Z ¼ 6 (the Fold level) to Z ¼ 12, which corresponds roughly to the Superfamily level of comparison. This suggests that the real gain in functional information occurs only between those structure comparison regimes. We note that these are exactly the thresholds where the greatest overlap occurs for proper hierarchical annotation of structure (Figure 6). This finding ties in functional information with structural comparison in a quantitative way. Recall that before we applied this formula we observed that Z ¼ 9 corresponds to the Superfamily level in CATH and should be the structural threshold for functional similarity. From just our qualitative argument we were able to predict the exact middle of the range for functional information gain. We see that functional information is only gained from the Fold level of annotation to the Family level of annotation. We employ the same technique to assess the theoretical limit of functional uncertainty at every structural similarity threshold. In order to do this, we calculate the average FFS value for each clustering of FSSP a particular Z threshold. This represents the average amount of information that we need to describe each domain if we know the functions of its structural neighbors at some structural similarity threshold, e.g. the functional fingerprint18 of the cluster (Figure 10(b)). This measure represents a limit of structure– function uncertainty, i.e. we cannot infer the function of a domain from structural homology modeling better than allowed by the speed with which structure and

Figure 10. (a) The FFS gain per domain with respect to structural similarity threshold. FFS of each cluster is compared to that expected by random for a cluster that size and added to the gain at that threshold (equations (6) and (7)). The final FFS gain is normalized by the number of domains annotated in the graph. The majority of the functional information is gained from Z ¼ 6 to Z ¼ 11, before and after those thresholds the information content obtained from structural comparison plateaus. Thus, we can quantify the amount of function information gained by correctly annotating a domain to its Fold as 0.095 bit per domain while correctly identifying the Superfamily yields around 0.15 bits per domain of functional information. (b) The intrinsic uncertainty with which we can expect annotation of function at a given structural similarity. For example, at Z ¼ 6 (Fold level) on average the domain function cannot be annotated to be more precise than 1.6 bits per level on the GO tree. Note that there are two plateaus where the FFS does not significantly change with respect to Z score: the first starting from Z ¼ 5 to Z ¼ 8 and the other starting from Z ¼ 9 all the way to Z ¼ 11, showing an intrinsic correlation between structure and function at the Fold and Superfamily level of annotation. This once again confirms the theoretical origins of this annotation by showing the conservation of function at those levels of structural comparison.

945

Quantifying Structure –Function Uncertainty

function diverge. The argument is that we cannot know exactly the function of the domain except to say that with some probability the function resembles the functionality of its all of its structural neighbors.18,51 Thus, we are able to calculate the theoretical limit of our ability to transfer functional annotation based on structural similarity. We can also note that this represents a quantitative measure of the speed of divergence between structure and function (Figure 10(b)). In order to get a sense of what bits on the GO tree represent we calculate the average conservation of biochemical function that correlates to FFS bit scores employing a method analogous to the one used by Gerstein and co-workers in their previous work.14 We use the well-established system of EC nomenclature to calculate the biochemical similarity between pairs of domains in every cluster of FSSP. The EC system describes the functionality of enzymes using a hierarchical system. A pair can share: (i) No functional similarity when the first levels of EC annotation do not match. (ii) General similarity when the first level describing reaction types match. For example, types include oxidoreductases or transferases. (iii) Precise functional similarity when there is a match down to the third level describing substrate specificity, e.g. 2.1.4 amidinotransferases or 3.1.4 phosphoric diester hydrolases. (iv) Exact functional similarity when the two domains share the exact function down to the fourth level of EC description like 1.5.1.7 saccharopine dehydrogenase or lysine-2-oxoglutarate reductase. Using this system we calculate the percentage of pairs that share a functional category for every cluster with respect to the FFS score of those clusters. In order to do this, we take all domains that are involved in enzymatic function (1269 domains), cluster that subgraph of FSSP as described above (Figure 1) and sample clusters at interval thresholds. We then note the FFS value of each cluster and the percentage of pairs that fall into EC comparison categories as outlined above. We then average the percentage of pairs in each category for ten FFS bins from 0.5 to 3 with step 0.25 (Figure 11). The results are not unexpected but illuminating. As the FFS score goes up (there is less functional entropy inside the cluster) the percentage of pairs that share large functional similarity (precise and exact) increases while those that share no functional similarity or only general functional similarity decrease (Figure 11(b)). At FFS ¼ 23, the average percentage of pairs that do not share or share only general functional similarity is around 80 while at FFS ¼ 20:5 the same percentage falls to around 20. This calculation gives a more intuitive sense of functional conservation for each FFS bit score. In terms of uncertainty, we can express the principle outlined above with

respect to percentage of different classes of annotation similarity. For example, if the novel domain is correctly annotated with a Fold then the uncertainty is around 1.6 bits of information on the GO tree (Figure 10(a)), which means that on average we can expect this domain to have no functional similarity with 45% of the domains in the same fold, general functional similarity with 20%, precise similarity with 20% and exactly the same function as another 15% of the domains annotated with the same fold (Figure 11(a) and (b)). During the course of our investigation of the quantitative description of structure–function divergence we were able to quantify the amount of functional information that we gain from understanding structural similarity in general. In particular, we were also able to quantify the average amount of information gained per domain with respect to a given structural similarity Z score and the level of uncertainty that is intrinsic in functional annotation from structural homology modeling. This level of uncertainty is a measure of the average divergence of structure and function. While these calculations are coarse and exhibit average trends they are useful in determinations of the information and uncertainty that we can expect from application of computational methods to large annotation projects such as that of newly sequenced, whole genomes.

Discussion Here, we address several of the very important problems facing the burgeoning field of structural genomics. First, we present a generalized method of comparing databases that use different methods to describe similar data. Comparison using graph theoretic methods yields insights into how best to correlate the scoring systems of one dataset with another and the extent to which the two databases agree. Most importantly, the use of graph theoretical methods to compare different databases enables random control, which in turn enables calculations of significance. By clustering graphs and using a generalized distance measure to assess the consensus between them, we can find the pair of cutoffs that yields the best overlap between two databases. This method can be used with slight modification in further research of structure–function relationships, evolution of metabolic networks and other problems requiring comparison of different comparisons of the same data type. Using the above methods, we were able to quantify the consensus between the most widely used databases describing protein domain evolution. By comparing the continuous, purely automatic structure comparison measure of FSSP to the manual, discrete annotation of SCOP and CATH we were able to define the level of structural similarity that best corresponds to the manually defined, hierarchical annotation describing Class, Family and

946

Quantifying Structure –Function Uncertainty

Figure 11. (a) A schematic representation of the EC hierarchical description system. The domains that share no common level with each other have no functional similarity, domains that share the first level, e.g. transferase share the same functional class and said to have general functional similarity; domains that share the bottom two levels are said to have precise and exact functional similarity, respectively. (b) The plot of functional similarity of pairs inside a cluster with respect to the FFS of that cluster. The percentage is calculated as an average of all clusters that fall into that FFS bin. The binning is done from FFS ¼ 2 0.5 to FFS ¼ 2 3 with step 0.25. Each bin has between 20 and 50 clusters represented. We can see that as FFS value increases (there is less functional entropy in the cluster) the percentage of domains with higher functional similarity to each other, sharing the bottom two levels of EC hierarchy, also increases with respect to the percentage of pairs that have no or little functional similarity. We can extrapolate this result to infer the distribution of functional similarity between each domain and any other in a cluster with a particular FFS score.

Superfamily levels of protein domains. Since Superfamily usually conserves function we were able to describe qualitatively a level of structural similarity ðZ ¼ 9Þ that conserves function. Further, we go on to show that the hierarchical annotation is a product of the highly non-random behavior and clustering of the structure comparison graph itself. Thus, we presented evidence that the choice of the schema of annotation is probably the result of the intrinsic properties of the protein domain structure space. The phase transitions in the automatic structure– comparison graph correspond exactly to the best consensus overlap with both the manually curated annotation and automatic assignment of Fold by neural net analysis.

Using this and other evidence we have been able to define the quantitative proof for the theoretical origins of the Fold, Superfamily and Family annotation system used for protein domains. There has been some previous work in the field on domain structure database comparison.31,32 Most recently Daggett and co-workers compared the fold annotation between SCOP, CATH and Dali Domain Dictionary.31 It is hard to compare our work in this area to the robustness of their results because that work lacks a null-model and thus a measure of confidence. However, most importantly, this and other comparable studies32 have only explored the correlation between discrete annotation systems such as SCOP CATH and

947

Quantifying Structure –Function Uncertainty

Dali Fold assignments.31 One work by Hadley & Jones sampled automatic structure comparison at different cutoffs32 but never fully explored the origins or justifications for the discrete annotation of SCOP and CATH by rigorous graph theoretical treatment or with random controls as is documented here. Perhaps the paper closest in approach to the one used here was done by Getz, Domany and co-workers33 where they assign SCOP and CATH Fold by using data from the automatic structure comparison of FSSP. However, Getz dealt with the issue of proper automatic assignment and never fully explored the comparison between the databases or the origins for the discrete annotation system explicitly. Perhaps the earliest work in the area of automatic Fold assignment is the work by Deitmann & Holm34 where they employed neural network techniques for automatic annotation of Fold based on Dali generated structure comparison scores. Here, we attempt to define simple characteristics in structure comparison graph that enable Fold assignment without the use of multi-input data such as function or the complex machinery of neural nets. We show simply that successful Fold assignment can be achieved with high rate of success based on clustering of protein domain structures at a particular Z threshold where the phase transition occurs in the size of the largest cluster of the structural comparison graph. In order to further improve on this, each structural cluster would be defined a particular threshold and clustering would be done at variable thresholds depending on the particular cluster. This approach can be implemented to further improve on the results presented here. Following the results of our previous work18 we go on to investigate the connection between our findings in the structural realm to the functionality of protein domains. Since the divergence of structure mirrors closely the divergence of function we attempt to correlate the speeds of these divergences. In order to do this quantitatively we introduce the concept of entropy in function space. This measure is meant to approximate the divergence of function for a set of sequences and is similar in spirit to functional fingerprints that were introduced earlier. We use the hierarchical annotation system of GO to calculate functional flexibility score for sets of sequences that fold into domains and generalize to sets of structurally similar domains. Using this paradigm, we were able to calculate the average functional information obtained from proper calculations of structural similarity. We accomplished this by comparing the FFS of structurally similar clusters of domains to FFS we would expect by random for a cluster that size. We found that on average 1.8 bits of information are gained from our understanding of structural similarity and proper assignment of domains into their structural clusters across all levels of annotation hierarchy. We extend this analysis to calculate the average

amount of information gained per domain from structural comparison with respect to the similarity Z score. We found that the majority of functional information gain was between the Fold and the Superfamily level of annotation. This construction also enabled us to calculate the average uncertainty of functional assignment with respect to structural annotation. The origin of this uncertainty lies in the speed of divergence of structure and function and the inherent ambiguity in determining the closest structural “homologue”. Thus, the annotation of function for a novel protein domain that is correctly classified as having the structure comparable to some Fold is subject to the proximal functionality of all domains in that Fold. This uncertainty measure can be very important in assessing the level of assuredness of annotation of large datasets such as whole genomes. To understand the biochemical implications of uncertainty as measured by FFS, we compared the bit scores to the average conservation between pairs of domains sharing some functionality as annotated by the EC system. Using this, we were able to quantify the number of pairs sharing different precision of functional specificity with each other with respect to the FFS score in the cluster.

Acknowledgements The authors thank Eugene Shakhnovich for illuminating discussions and advice, Shoshana Wodak for critical reading and Charles DeLisi for critical reading and support.

References 1. Chirpich, T. P. (1975). Rates of protein evolution: a function of amino acid composition. Science, 188, 1022 –1023. 2. McLachlan, A. D. (1980). Repeated folding pattern in copper – zinc superoxide dismutase. Nature, 285, 267 –268. 3. Inana, G., Piatigorsky, J., Norman, B., Slingsby, C. & Blundell, T. (1983). Gene and protein structure of a beta-crystallin polypeptide in murine lens: relationship of exons and structural motifs. Nature, 302, 310 –315. 4. Bork, P. & Doolittle, R. F. (1992). Proposed acquisition of an animal protein domain by bacteria. Proc. Natl Acad. Sci. USA, 89, 8990– 8994. 5. Reardon, D. & Farber, G. K. (1995). The structure and evolution of alpha/beta barrel proteins. FASEB J. 9, 497 –503. 6. Bork, P., Gellerich, J., Groth, H., Hooft, R. & Martin, F. (1995). Divergent evolution of a beta/alpha-barrel subclass: detection of numerous phosphate-binding sites by motif search. Protein Sci. 4, 268–274. 7. Gilbert, W. (1978). Why genes in pieces? Nature, 271, 501. 8. Go, M. (1983). Modular structural units, exons, and function in chicken lysozyme. Proc. Natl Acad. Sci. USA, 80, 1964– 1968.

948

9. Grishin, N. V. (2001). Fold change in evolution of protein structures. J. Struct. Biol. 134, 167– 185. 10. Thompson, T. B., Garrett, J. B., Taylor, E. A., Meganathan, R., Gerlt, J. A. & Rayment, I. (2000). Evolution of enzymatic activity in the enolase superfamily: structure of o-succinylbenzoate synthase from Escherichia coli in complex with Mg2þ and o-succinylbenzoate. Biochemistry, 39, 10662– 10676. 11. Janecek, S., Svensson, B. & Henrissat, B. (1997). Domain evolution in the alpha-amylase family. J. Mol. Evol. 45, 322– 331. 12. Ikeo, K., Takahashi, K. & Gojobori, T. (1995). Different evolutionary histories of kringle and protease domains in serine proteases: a typical example of domain evolution. J. Mol. Evol. 40, 331– 336. 13. Ponting, C. P. & Dickens, N. J. (2001). Genome cartography through domain annotation. Genome Biol. 2 Comment 2006. 14. Wilson, C. A., Kreychman, J. & Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297, 233– 249. 15. Ison, J. C. (2000). Exploring protein domain structure. Brief Bioinformatics, 1, 305– 312. 16. Orengo, C. A., Bray, J. E., Buchan, D. W., Harrison, A., Lee, D., Pearl, F. M. et al. (2002). The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics, 2, 11– 21. 17. Kinch, L. N. & Grishin, N. V. (2002). Evolution of protein structures and functions. Curr. Opin. Struct. Biol. 12, 400– 408. 18. Shakhnovich, B. E., Dokholyan, N. V., DeLisi, C. & Shakhnovich, E. I. (2003). Functional fingerprints of folds: evidence for correlated structure– function evolution. J. Mol. Biol. 326, 1 – 9. 19. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. (2003). Evolution of the protein repertoire. Science, 300, 1701– 1703. 20. Thornton, J. M., Orengo, C. A., Todd, A. E. & Pearl, F. M. (1999). Protein folds, functions and evolution. J. Mol. Biol. 293, 333– 342. 21. Todd, A. E., Orengo, C. A. & Thornton, J. M. (2001). Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113 – 1143. 22. Galperin, M. Y. & Koonin, E. V. (1998). Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1, 55 – 67. 23. Devos, D. & Valencia, A. (2001). Intrinsic errors in genome annotation. Trends Genet. 17, 429– 431. 24. Dokholyan, N. V. & Shakhnovich, E. I. (2001). Understanding hierarchical protein evolution from first principles. J. Mol. Biol. 312, 289– 307. 25. Doolittle, R. F. (1994). Convergent evolution: the need to be explicit. Trends Biochem. Sci. 19, 15 – 18. 26. Ponting, C. P. & Russell, R. R. (2002). The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31, 45 – 71. 27. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093– 1108. 28. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536– 540.

Quantifying Structure –Function Uncertainty

29. Holm, L. (1998). Unification of protein families. Curr. Opin. Struct. Biol. 8, 372– 379. 30. Elofsson, A. & Sonnhammer, E. L. (1999). A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics, 15, 480–500. 31. Day, R., Beck, D. A., Armen, R. S. & Daggett, V. (2003). A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 12, 2150– 2160. 32. Hadley, C. & Jones, D. T. (1999). A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Struct. Fold. Des. 7, 1099– 1112. 33. Getz, G., Vendruscolo, M., Sachs, D. & Domany, E. (2002). Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins: Struct. Funct. Genet. 46, 405– 415. 34. Dietmann, S., Park, J., Notredame, C., Heger, A., Lappe, M. & Holm, L. (2001). A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucl. Acids Res. 29, 55 – 57. 35. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J. et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409, 860– 921. 36. Buchan, D. W., Shepherd, A. J., Lee, D., Pearl, F. M., Rison, S. C., Thornton, J. M. & Orengo, C. A. (2002). Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res. 12, 503– 514. 37. Laskowski, R. A., Watson, J. D. & Thornton, J. M. (2003). From protein structure to biochemical function? J. Struct. Funct. Genomics, 4, 167– 177. 38. Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucl. Acids Res. 30, 264– 267. 39. Holm, L. & Sander, C. (1997). Dali/FSSP classification of three-dimensional protein folds. Nucl. Acids Res. 25, 231– 234. 40. Holm, L. & Sander, C. (1998). Touring protein fold space with Dali/FSSP. Nucl. Acids Res. 26, 316– 319. 41. Gough, J. & Chothia, C. (2002). SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucl. Acids Res. 30, 268– 272. 42. Holm, L. & Sander, C. (1995). Dali: a network tool for protein structure comparison. Trends Biochem. Sci. 20, 478– 480. 43. Dokholyan, N. V., Shakhnovich, B. & Shakhnovich, E. I. (2002). Expanding protein universe and its origin from the biological Big Bang. Proc. Natl Acad. Sci. USA, 99, 14132– 14136. 44. Hartigan, J. A. (1975). Clustering Algorithms, Wiley, New York. 45. Newman, M. E., Watts, D. J. & Strogatz, S. H. (2002). Random graph models of social networks. Proc. Natl Acad. Sci. USA, 99, 2566– 2572. 46. Callaway, D. S., Hopcroft, J. E., Kleinberg, J. M., Newman, M. E. & Strogatz, S. H. (2001). Are randomly grown graphs really random? Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 64, 041902. 47. Newman, M. E., Strogatz, S. H. & Watts, D. J. (2001). Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 64, 026118. 48. Perlovsky, L. I. (2002). Statistical limitations on

Quantifying Structure –Function Uncertainty

molecular evolution. J. Biomol. Struct. Dyn. 19, 1031–1043. 49. Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M. et al. (2000). InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics, 16, 1145– 1150. 50. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M. et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25 – 29. 51. Shakhnovich, B. E., Harvey, J. M., Comeau, S., Lorenz, D., DeLisi, C. & Shakhnovich, E. (2003). ELISA: structure– function inferences based on statistically significant and evolutionarily inspired observations. BMC Bioinformatics, 4, 34.

Edited by J. Thornton

949

(Received 20 October 2003; received in revised form 13 January 2004; accepted 3 February 2004)

Supplementary Material for this paper comprising one Figure and its legend is available on Science Direct