S.L. Salzberg, D.B. Searls, S. Kasif (Eds.), Computational Methods in Molecular Biology
0 1998 Elsevier Science B.V. All rights reserved CHAPTER 1 1
Statistical analysis of protein structures Using environmental features for multiple purposes Liping Wei, Jeffrey T. Chang and Russ B. Altman Section on Medical Informatics, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305-5479, USA. Phone: +I 650-723-6979; Fax: +I 650-725-3394; Email: (wei, jchang, altman)@smi.stanford.edu
List of Abbreviations 3D
three-dimensional
PDB
ATP
adenosine triphospbate
LPFC
Library of Protein Family Cores
SCWRL Side Chain placement With a Rotamer Library
Protein Data Bank
NMR
nuclear magnetic resonance
1. Protein structures in three dimensions In its native state, a protein folds into a beautiful, three-dimensional (3D) structure [ 11. Generally, the amino acid sequence of a protein can determine its three-dimensional conformation, which in turn determines the protein’s functions. However, the precise manner in which sequence determines structure and structure determines function is not always clear. The 3D structures of proteins are, for the most part, determined experimentally by X-ray crystallography or nuclear magnetic resonance (Nh4R). In recent years, the number of protein structures determined experimentally and deposited in structural databases such as the Protein Data Bank (PDB) [2] has been increasing exponentially. However, the acquisition of this structural information is still a slow and expensive process, and the structural information is extremely valuable. For these reasons, careful scrutiny of the available structures to extract a maximum of useful information is critical. In the PDB, many of the known protein structures belong to the same structural or functional families. This offers us the opportunity to statistically study multiple structures with the same general three-dimensional shape (also called the “fold”) or the same function. The advantage of statistical analysis on a large set of structures, rather than caseby-case analysis, is that it becomes possible to distinguish general, conserved features from individual idiosyncratic ones, to discover patterns that could only be revealed by pooled data, and to use these patterns for analysis of new sequences and structures. This chapter is the first on topics related to protein 3D structures. It outlines a statistical approach to collectively analyzing protein structures, which are represented with physical and chemical properties defined at multiple levels of detail. As we will show, the statistical descriptions of protein families contain rich biochemical information and can be used to characterize the molecular environments within protein structures, recognize functional 207
208
sites in new proteins, compare the environments around amino acids, and predict some protein 3D structures from their sequences. In particular, we have implemented a system called FEATURE that can be used to describe 3D motifs within protein families. FEATURE has been used to detect sites within protein structures - such as binding sites for cations (calcium, magnesium, sodium) and small molecules such as ATE! FEATURE has also been used to compare the average environments around the twenty amino acids, in order both to understand the different chemical milieu in which they exist, and to create a similarity matrix for performing protein sequence alignments. There are many ways to represent a protein structure. The most basic description includes the set of atoms in the structure, their three-dimensional coordinates, and their connectivity. Other useful representations can be a specification of the secondary structural segments (helices, @-strands,and loops), a plot of all the distances between amino acids, or a computation of the electrostatic field generated by the atoms in the protein. The system we will discuss in this chapter, called FEATURE, represents protein structures using the spatial distributions of a variety of physical and chemical properties [3]. When multiple structures from the same structural family or with the same functional site are aligned, one can apply statistical analysis to the ensemble of all structures and generate a statistical description of the family or site. These statistical descriptions are useful for a variety of applications discussed later in this chapter. Due to the rapidity of gene sequencing and the increasingly successful efforts of the genome projects, thousands of DNA and protein sequences are determined every year. Each of the protein sequences must be associated with a function and, ideally, a 3D structure in order to fully understand the basic molecular biology of the organisms from which they are obtained. However, only a small number of these protein 3D structures can be determined by the traditional experimental methods which are both time- and labor-intensive. The full potential of the genome projects to catalyze the biotechnology, pharmaceutical, and chemical engineering industries depend on methods for rapid and accurate analysis of gene sequences. Thus, as the gene sequences come out of the automated sequencing machines and are stored in large databases, the problem of assigning structure and function becomes a fundamentally computational one. The most basic approach relies on aligning a new sequence with sequences whose structure or function is already known. In addition to seeking alignment of identical amino acids, these methods allow a substitution matrix to be used which gives partial match credit to amino acids that are in some way similar. This approach works well when the sequence identity (the number of exact matches in a high-quality alignment) between the two sequences is reasonably high (for example, greater than 25%). However, two proteins with sequence identity less than 20% can adopt the same fold, and simple sequence comparison approaches may fail to detect similarity. One way to address this problem is to test the compatibility of a new protein sequence with a set of previously determined folds. This is often called the “threading”, or “fold recognition”, approach to protein structure prediction [&6]. It is based on the assumption that the side chains of an amino acid chain, folded into their native conformations, form energetically favorable interactions. Thus, when a sequence is mounted upon a backbone similar to its own, its side chains will form favorable local stabilizing interactions. Threading may be more sensitive than sequence
209
alignment because it models interactions among amino acids, and not just sequential similarity. Statistical descriptions of protein structures are useful in both of these structure prediction approaches. They can be used to construct amino acid similarity matrices for sequence comparison [7]. When sequence similarity is low, the same statistical descriptions can be used as the basis for threading methods. Section 2 is a tutorial section that introduces the concepts underlying statistical comparison of two distributions, classification methods and evaluation, conditional probability, and Bayes’ Rule. We provide the basic review in order to make the chapter relatively self-contained. More detailed information about statistical inference techniques can be found in refs. [8,9]. Readers familiar with the material can go directly to section 3 , which discusses a method for representing and statistically analyzing protein structures. Section 4 includes four applications of the statistical approach. The chapter concludes with a brief summary.
2. Statistics, probabiliQ, and machine learning concepts Determining the key features that distinguish two populations is a common statistical problem. Given all the properties that describe members from each population, which are most important for distinguishing between the populations? Given the set of distinguishing features, how should we classify a new individual into one of the populations, based on the attributes of that individual? The first task is called the inference problem, and the second is known as the classification problem. This section first introduces statistical inference methods, with emphasis on the Manr-Whitney Rank-sum test. It then discusses the classification problem and the evaluation of classification algorithms. There is an important class of algorithms that are based on Bayes’ Rule. Section 2.3 will introduce Bayes’ Rule and the notion of conditional probability.
2.1. Statistical inference from two sets of data Quite often we are faced with inference problems involving comparison of two probability distributions based on two sets of data samples, with each set being randomly sampled from one of the distributions. For example, we often need to determine whether the mean values of two distributions are the same, based on our data samples. Many statistical methods have been developed to solve this problem, and they can be roughly divided into two categories: parametric and nonparametric [9]. Parametric methods, such as the z-test or t-test, should be used when both distributions are, or are close to, Gaussian normal. As long as the assumption of Gaussian normality is valid, the parametric methods are more powerful in that they are more likely to detect small differences when they exists. However, if either of the distributions is non-Gaussian, nonparametric methods must be used. The Mant-Whitney Rank-sum test is a nonparametric test that compares two distributions with roughly the same shape, which need not be Gaussian. All the data points in the two sets are ordered (or ranked) by their value. If the two distributions are
210
roughly equivalent, then the average rank of members originally from the first set should be about the same as the average rank of data points from the second set. Moreover, the data points from the two data sets should be mixed together evenly throughout the ordered set of combined data. If one distribution has a higher mean value than the other, then the average rank of its members will be lower than that of the members of the other distribution. The average rank for data from each set can be computed. Let T be the sum of the ranks for a data set with nl members when compared to a data set with n2 members. When both data sets are drawn from the same distribution, and the larger of the data sets contains more than 8 members, the statistic T approaches the normal distribution with mean ( p ~ and ) variance (or2) given by
Thus, the z-statistic,
z = -T - P T oT
’
measures the difference between the two distributions, and also has a standardized normal distribution. The statistical significance of z, the p-level (ranging from 0 to l), can be looked up in any statistics books or widely available software. The p-level is the probability that the observed difference is due to chance alone. Thus, the lower the p-level, the more likely it is that the observed difference is significant. The implementation of the Rank-sum test is somewhat more complicated when there are ties in the values of the data set (hence multiple data points with the same rank) and is beyond the scope of this discussion. For more details on the Rank-sum test and other statistical methods, see ref. [9]. Nevertheless, this test (also known as the Wilcoxon Rank test) is good at detecting difference in nonparametric distributions and can work with a fairly low sample size. We show one application of this test in section 4.
2.2. Classijication problems and evaluation of classiJcation algorithms The problem of classification is that of assigning a new individual into one of a set of possible classes or populations on the basis of a set of features. Classification methods can be divided into two types: supervised classification and unsupervised classification. Unsupervised classification (also referred to as clustering methods) is used when the classes are not known ahead of time, and so the classes are defined by looking at individuals and grouping individuals together into clusters. Supervised classification is used when the classes have already been specified, and the task is to determine which features of the classes should be used to evaluate and assign a new unlabeled individual to a class. One example of a supervised classification problem is the task of distinguishing protein structures that bind calcium ions from those that cannot bind calcium. Proteins that bind calcium ions will have different structural and biochemical features than those that do not. Now, if given a new 3D protein structure, how can we predict if it will bind calcium? We look at its structural and biochemical features, compare them to those in
211
the examples we have seen before, and decide whether to “classify” the region in the new structure as a calcium binding site or not. There are many classification algorithms that address the classification problem by statistics, probability, or machine learning methods. Most of them infer a model (or mapping function) that relates the features with the categories based on the training data. The model will be used to classify new, previously unseen instances. A perfectly accurate algorithm should be able to classify each instance into its proper category. The accuracy of a classification algorithm can be defined by two criteria: sensitivity and specificity, defined by TP TN sensitivity = ___ specificity = TP + FP’ TN + FN’ where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives. For our calcium example, “sensitivity” measures the ability to detect when there really is a calcium binding site, and “specificity” measures the ability to reject regions that are not calcium sites. When a model is built from a set of training data, it is not unusual for the model to “match” or “explain” the training data better than it explains other previously unseen sets of data. Very often models contain subtle sources of bias that make them perform better on the training data sets than they do on test data sets. Therefore, it is important to calculate sensitivity and specificity on a test data set not used in constructing the classification model. Otherwise the model may only memorize and recognize examples it was given, but may not have extracted the more general features necessary for classification of previously unseen examples. When a model is built that performs very well on the data used to construct it, but poorly on equivalent data not used in the construction, we say that the model “ovefits” the data. Instead, a better approach is to calculate the sensitivity and specificity of the algorithm on a separate, previously unseen, test data set. It is common to divide the whole data set into a training set and a test set. Then, hiding the test set, construct a classification model using only the training set. Finally, calculate the sensitivity and specificity of the classification model on the test set. Although this trainand-test approach avoids the ovefitting problem, it has other potential problems. First, the estimated accuracy may be influenced by the different ways of dividing the whole data set into training and test set if this division is in some way biased. Second, in some cases we have so little available data that we cannot afford to divide them. One solution to building and evaluating models with sparse data sets is to estimate accuracy using cross-validation. The complete data set is divided into n subgroups. For each subgroup, a classification model is built using the remaining n- 1 subgroups. The model is then tested on the subgroup that was left out in order to measure its accuracy. This procedure is repeated for every subgroup. The overall accuracy will be the averaged accuracy from the n steps. Cross-validation has the advantages of being unbiased and making full use of the available data. 2.3. Conditional probability and Bayes ’ Rule
Some classification algorithms are based on Bayes’ Rule. Before we introduce Bayes’ Rule, we need to understand the principles of basic probability. Let the probability of an
212
event S be P(S), and the probability of another event E be P(E). Denote the probability that S and E both occur, which is often called the joint probability, by P(SE). The complement of event S is the event that S does not occur, and is often written as 3. We have
P(3) = 1 -P(S). The probability of event S , given the fact that event E has already occurred, is called the conditional probability of S given E, and is written as P(S1E). For example, suppose we are interested in finding out if a site in a protein structure is a calcium binding site. Let S be the event of the site being a calcium binding site. Let E be the event of the site having five oxygen atoms in the shell about 2 A away from the site center. Now P(SIE) is the conditional probability of seeing a calcium binding site given that we have already seen five oxygen atoms in the 2 A shell. Knowing that there are many oxygen atoms around the site makes the probability of its being a calcium binding site much greater. Given new evidence about the abundance or deficit of oxygen atoms, our estimate of the possibility of seeing a calcium binding site changes accordingly. Bayes’ Rule can be used to update conditional probabilities of events when given new evidence [8]. The “updated” conditional probabilities are useful in decision making. For one event and one piece of evidence, Bayes’ Rule gives
The goal is to find the posterior probability P(SJE).Unconditioned probability P ( S ) , often referred to as prior probability, needs to be specified from experience or previous data. The conditional probabilities of the evidence given different states of the event, P(E IS) and P(E also need to be provided. Prior and posterior refer to probabilities before and after observing the evidence E. Bayes’ Rule gives a simple method for computing the new probabilities of an event based on our previous best estimate and the information contained in the new evidence.
Is),
3. Statistical description of protein structures In the FEATURE system, we represent proteins using the spatial distributions of critical properties. A comprehensive set of chemical and physical properties are used to describe structures at multiple levels of detail. The properties can be grouped roughly into categories based on whether they depend on the identity of the atom, the chemical groups to which they belong, the amino acids to which they belong, the secondary structures to which they belong, or some other properties that can be computed knowing the protein structure. A summary list of one set of properties is shown in Table 1. This list is representative, and the detailed list of the properties and their definitions can be found in ref. [3]. Other properties of interest may exist, and they may certainly be added. Our FEATURE method starts with a 3D protein structure as reported in the Protein Data Bank (PDB [ 2 ] ) .A three-dimensional grid keeps track of the locations of the properties
213
Table 1 A summary list of chemical and physical properties used by FEATURE Atom-based properties Residue-based properties Chemical group-based properties Secondary structure-based properties Other properties
Atom types; hydrophobicity; charge; positive charge; negative charge Residue types; hydrophobicity classification Hydroxyl; amide; mine; carbonyl; ring system; peptide Secondary structure classification VDW-volume; B-factor; mobility; solvent accessibility
of the protein’s atoms. The grid is cubic, with a unit-cell diagonal chosen to be the length of a carbon-oxygen single bond, so that two nearby atoms rarely occupy the same cell. The axes of the grid are determined by the coordinate system specified in the PDB file. Each grid box stores the values of the properties for the atom(s) it contains. Very often, a small region of the protein structure (sometimes called a site) is of particular interest. A site is a microenvironment within a structure, distinguished by a structural or functional role. A site includes the whole local neighborhood around the center of interest within a certain distance (lOA, for example). (The dimensions of medium-sized proteins are 30 to loo& on average.) In order to study the spatial distribution of the properties within a site, the FEATURE system combines the grid cells into bigger spatial volumes and sums the property values in the grid cells that fall within the volume. (The individual grid cells are too small to analyze for significant occurrence of properties.) For example, FEATURE can combine grid cells into radial shells around the central point of interest. It can also collect grid cells into oriented volumes to preserve detailed spatial relationships [ 101. Whatever the strategy for dividing space into smaller volumes, the final product from FEATURE is a set of summed property values, one value for every property at every volume. To take advantage of the increasing number of protein structures available and, more importantly, to avoid characterizingthe idiosyncrasies of a single protein, one can analyze multiple, aligned sites with the same structural or functional characteristic, and compare them to a set of background “control” nonsites - microenvironments that do not have the specific structural or functional characteristic. For each property, at each spatial volume, there are a list of values from the sites, and a list of values from the nonsites. From these, we have the basic lists of numbers for sites and nonsites that form distributions that can be analyzed with various statistical approaches for different purposes.
4. Applications of statistical analysis of protein structures What is the value in collecting these statistics on the microenvironments of protein structures? First, one can characterize the key features of protein sites with important structural and functional roles. Then, the characterizations can be used to recognize sites in new protein structures that have not been analyzed. The description is also useful in protein structure prediction, first by constructing substitution matrices based on a comparison of amino acid environments, and second by recognizing the compatibility of protein structure environments with sequences using threading methods.
214
4.1. Characterizing the microenvironmentsurrounding protein sites
The first use of FEATURE [3] is simply to assist in the analysis of protein structure and function, in order to identify the special features of a structural or functional site, by comparing examples of sites with examples of control nonsites. For this purpose, FEATURE looks at corresponding volumes in sites and nonsites and reports those volumes in which there is a significant difference in the abundance of particular properties. Structures with common shape or h c t i o n should share certain features in their physical layout, and these should be discerned by comparison with other structures which lack these shared features. As discussed in section 3, for each site and nonsite, the property values are computed and stored in the grid. A common strategy is to divide the space of interest into radial shells of thickness 1 8 and sum the property values within each shell. In order to determine if a property within a shell has significantly different distributions for sites and nonsites, FEATURE uses the Rank-sum test (discussed in section 2.1) to compare the list of values for that property in the sites with the list of values from nonsites. The significance level of the observed difference is given by the p-level returned by the Rank-sum test. If the p-level is lower than a specified cutoff value (for example, 0.01), that property at that shell is reported to be significantly different for the sites compared to nonsites. It may be surprising that looking at the environment surrounding a site with concentric radial shells is sufficient to pick out many key features. Indeed, the analysis of sites can sometimes benefit from a more localized division of space, so that the spatial relationships are preserved and not radially averaged. FEATURE can also analyze properties in oriented blocks [ 101. These analyses often require more data since the total volume being compared are smaller and so need better sampling to reliably detect differences. FEATURE has been used to characterize a number of microenvironments, including calcium binding sites, disulfide bonding cysteines, serine protease active sites, ATP binding sites, and sodium binding sites. It has found many of the previously observed features of the sites as well as some novel findings. FEATURE summarizes the findings in a graphic plot, an example of which - the description of the microenvironment of ATP binding sites - is shown in Fig. 1. Some of the significant findings are summarized below: (1) There is a deficit of atoms from 0 to 4A, reflecting the empty binding pocket. (2) There are significantly more sulfur atoms at radius 10 A than observed in the random control nonsites. (3) There is an abundance of positive charges at radii 3 to lOA, and an abundance of negative charges at radii 10 to 12A. The ATP molecules are surrounded by positive charges, and the positive charges are, in turn, surrounded by an outer shell of negative charges. (4) There are significantly more ASP, LYS, GLY, and MET residues in ATP binding sites observed in controls. (5) There is an abundance of charged residues at radii around 7 to 12A. (6) There is an abundance of both helices and 6-strands. It is well known that the charged ASP and LYS residues are important in binding ATPs.
215
DH-NWE- IS-FW DH-NRIE- I S-c DH-NWlE- IS-N DM-PIRII- IS-D C#i-NRHE-IS-S
. .
. . . .
. . .
. . . . . . .
. . .
t!T-RCCESSIBILIN U E m Is* UE-cF1IE- IS-RRG SIDUE-M-ISSIDUE-M-ISSIDUE-M-IS-OLN SIWE-NFIE-IS-GLU SIDUE-NFCIE-IS-OLY IDUE-IWlE-I S- ILE IDE-ME-IS-LEU IDUE~-IS-LYS IDUE-PRE- IS m SIWE-WIZ-IS-P!4€ SINE-PRE-ISSINE-PRE-IS-SER SIWE-PRE-IS-THR SIWE-PRE-IS-TRP SIWE-WIZ-IS-TWI SIDUE-PFWE-IS* SIDUE-PFIE-IS-HOH SIDUE-ME-IS-DTHER SIDUE-URSSI-IS-WDRPHOBIC SIWE-CLRSSI-IS-MARGED 1 DUE-CLRSS 1- IS-UYKNOUN IDUE-CUSS2- IS-raP(POLRR IDE-CLRSS2- I S-WLFR IDUE-CLASS2-1S-XIDIC I DUE-CLRSSZ- IS-BRS IC IDUE-URSS2-IS-W%NOUN DNDRRY-STRUCNRE1-IS-3-HELIX DNDRRY-STRUCTURE 1- IS-4-HEL IX DNDflW-STRUCTUREl-IS-5-HELIX DNDAW-STRUCTURE 1- IS-BR IDOE DNDRRV-STRUCTUREl-IS-STRRND ONDRW-STRUCTWEl-IS-TURN
CDNOARVSTRUCNREl- IS-CO IL CONDRRVSTRUCTCX 1- IS-HE1 CONDRW-STRUCTWU-IS-HELIX CDNDRRY-STWTUREZ-IS-BETR CDNDARY-STRUCTURU-IS-COIL
. . . . . . . . . . . .
. . . .. . . . . . . . . . . . . . . .
:::::m
. . . . . . . . . . . .
. . . .. . . . . . . . . . . . . . . . . . . .
I:: . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . .
1[II
. . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . .
fr I,....:
. . . .' .. .' .. .. "
. . ii. . . . . . .
.. . .. ... .. ... .. .n . m . .ment of an ATP binding site. Along the
. . . . . . . .
. . . . . . . . .:
:m
. . . . . . . . . . . .
. . . .
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
. . .
.. . .. .:. :. .:. :. ..:.. m..
Fig. 1. Description of the microenviron-
vertical axis are properties, and along the horizontal axis are shell volumes (for example, shell 1 is the shell from 1 to 2 around the origin). Dark-gray cells mark propertyvolume pairs for which the ATP binding sites have significantly high values; the light-gray are cells for which the ATP binding sites have significantly low values. Empty cells show no significant difference.
a
Thus, it is not surprising that they are over-represented in the ATP binding sites (as shown in Fig. 1). Their presence also causes the abundance of charges in the environment. Glycine is a conserved residue, because any larger residue causes steric hindrance to the binding of the ribose ring [ 111. Our observation that helices and fi-strands are abundant in ATP binding sites is consistent with the secondary structural motif found in nucleotide-
216
binding proteins - the Rossmann Fold [ 111. We currently have no feature corresponding to supersecondary structural motifs. Adding such a feature to the system would allow it to recognize the nucleotide binding fold directly, instead of inferring it as a collection of low level features - helices and p-strands.
4.2. Recognizing protein sites using environmental statistics A second use of FEATURE is to recognize microenvironments using the description model built by comparing sites with nonsites. Given a new, unannotated protein structure, one of the first things we want to know is whether it has any functional sites. Although we can perform biochemical experiments to elucidate functional sites, they are often expensive. Therefore, computational methods offer a useful screening strategy. Given a protein structure, we would like to have an automated method for recognizing a h c t i o n a l site of interest. If we use known examples of sites and nonsites, the problem becomes one of supervised classification. In the last section, we established that FEATURE statistically characterizes the properties of sites by comparing them with nonsites. In this section, we will outline a general purpose, automated recognition system that uses the distinguishing features of a site and a description of a structure to decide if the structure contains the site [ 121. In order to explain its decision, the system generates a prose report of those features which contribute most to its classification decision. As described, FEATURE calculates the spatial distributions of properties in the example sites and nonsites. The values of these properties in the region of interest can also be computed. Conceivably, if the region is a site, then the property values for the region should be similar to the property values of the example sites. How can we assess similarity, since the ranges of property values for the sites and nonsites may overlap? How can we decide whether the observed values from the new structure are more compatible with the statistical distribution of site values or the nonsite values? We use a Bayesian scoring function to score the likelihood that a property value is drawn from the site distribution. For one property within one spatial volume, the posterior probability that the property value, u, of the new structure, comes from the statistical distribution of site values can be calculated using Bayes’ Rule: P(site1u) =
P(u1site)P(site) P(uIsite)P(site) + P(ulnonsite)P(nonsite) ’
where P(site) and P(nonsite) are the prior probabilities of site and nonsites, and need to be provided to the system. They are based on the overall probabilities of seeing a site of interest, and they must be estimated. P(u1site) is the probability of seeing the observed value in a site, while P(u1nonsite) is the probability of seeing the observed value in a nonsite. These probabilities can be evaluated by dividing the total range of site and nonsite values into bins, and comparing the counts in each bin for sites and nonsites. Given the observed value, u, in a new structure, we identify the bin in which u falls and assign P(u1site) to be the ratio of counts in that bin for sites over the total number of counts.
217
’rotein: lARC :enter of site of interest: (-2.284 -4.002 10.663) ikelihood score: -86.063 t is most probably not an ATP-binding site. itrongest evidence SUPPORTING that it is a site: VDW-VOLUME at shell 0, 1, 2 is low. ATOM-NAME-IS-ANY at shell 0, 1 is low. ATOM-NAME-IS-C at shell 1, 2 is low. RESIDUE-CLASS2-IS-ACIDIC at shell 10 is high. itrongest evidence AGAINST that it is a site: RESIDUE-CLASS1-IS-CHARGED at shell 7, 8, 9 is low. RESIDUE-NAME-IS-ASP at shell 10 is low. POS-CHARGE at shell 5. 6 is low. NEG-CHARGE at shell 10, 11 is low. CHARGE at shell 5, 6, 7, 8 is low. VDW-VOLUME at shell 8, 10, 11 is low. VDW-VOLUME at shell 4 is high. MOBILITY at shell 11 is low. MOBILITY at shell 3 is high. Fig. 2. An example of prose explanation: why the binding pocket in serine protease lARC is identified as not an ATP binding site. The key properties are presented in order of strength of evidence, grouped by whether they are supporting or refuting evidence.
The score for the overall likelihood of the query region being a site is the sum of the logarithm of the probability update ratios:
A detailed explanation of the evidence contributingto the classification of a site is a useful adjunct to the actual classification result. FEATURE generates a report of the strongest individual pieces of evidence contributing to the total score in eq. (2). This report can be used to understand how different features contribute to the classification and to provide information about which properties would need to be modified in order to change the degree of fit between the new structure and the sitelnonsite models. Figure 2 shows an example of a report explaining why the binding pocket in a serine protease structure (PDB identification number 1ARC) is not an ATP binding site. lARC binds N-P-tosyl-L-Lys chloromethyl ketone hydrochloride (TLCK), a molecule similar to ATP in both shape and size [13]. Thus, the binding pocket in lARC has similar geometric properties to ATP binding sites, shown in “strongest evidence SUPPORTING that it is a site” in Fig. 2. However, the “strongest evidence AGAINST that it is a site” suggest that the binding pocket does not possess other important biochemical and biophysical properties necessary for binding ATP Since there is more evidence suggesting that the pocket does not have the right microenvironment to bind ATE the pocket should be classified as not being an ATP binding site.
218
Fig. 3. Scanning the structure of phosphotransferase (PDB id ICSN) for ATP binding sites. The top 20 highest scoring points are marked with dots. Only the backbone of the protein structure is shown. An ATP molecule is depicted at the actual binding sites, with the actual orientation, for comparison with the computational results.
FEATURE has been successfully applied to recognizing several ionic binding sites and ligand active sites [12]. The accuracy of this method has been evaluated using sensitivity and specificity, computed both on independent test sets and in cross-validation. The sensitivity and specificity typically range from 86% to 100%. Extensive sensitivity analyses have been performed on parameters of the program such as the prior probability of site, the number of bins, the size of the environment, and significance cutoff. These studies have shown that the method is robust to a wide range of values for each parameter. In particular, the method performs consistently over a wide range of prior probabilities the most difficult parameter to assess accurately. FEATURE has also been successfully used to scan whole protein structures for the occurrence and location of sites, such as calcium binding sites and ATP binding sites. First, a search grid is defined on top of a new structure. Then, the recognition function is applied at each grid point to compute the likelihood that the region around that point
219
is a site. Finally, high-scoring grid points are labeled and can be visualized graphically. Figure 3 shows a typical result of scanning the structure of phosphotransferase (PDB identification number 1CSN) for ATP binding sites. The actual ATP binding site in phosphotransferase, experimentally determined by X-ray crystallography, is clustered and well localized by high-scoring points. There are a few high-scoring points that are far away from the actual binding sites, and they should be considered false positives according to PDB documentation which we use as gold standard. 4.3. Comparing the environments of amino acids and constructing a substitution matrix
The techniques developed in the previous sections provide a general framework for describing and recognizing the patterns in the biochemical environments of a set of protein sites. An interesting application of this technology is to study the environments found around each of the amino acids. We have created statistical descriptions of the biochemical features found around each of the amino acids [7]. These descriptions provide insights into the roles that the 20 amino acids play within protein structure. Comparing the similarities and differences in the environments between pairs of amino acids provides information on their likelihood to mutate into one another. For our analysis, -100 instances of each amino acid were randomly selected from a set of 20 nonhomologous proteins [14]. The environment around each of the amino acid instances was described by FEATURE as a spatial distribution of physical and chemical properties. FEATURE compared the environments of each amino acid (considered as sites) with random control environments (considered as nonsites), then compared each amino acid environment (site) with each of the other amino acid environments (nonsites). All significantly different features between each amino acid environment and the random control environments, and different features between any two amino acid environments, were reported. Figure 4 shows two examples of comparisons of amino acid environments. Graphical descriptions of the all comparison results are available on the WWW at
During the process of evolution of a protein sequence, one amino acid may be substituted by another at some location. The tolerance of a microenvironment for one amino acid or another is not only a function of the properties of the lost amino acid, but also a function of the properties of the environment surrounding the lost amino acid. We used the statistical descriptions of amino acid environments to construct a substitution matrix (the WAC matrix) to illustrate the validity and utility of comparing amino acid environments. When one compares two amino acid environments using FEATURE, the Rank-sum test yields a z-value that describes the normalized difference between the two amino acid environments for a property at a particular volume. The overall difference between the two amino acid environments can be estimated by taking the sum of the squared z-values for each property and volume. This yields a single number that represents comparatively the difference between the features found in the environments of the amino acid pair. If
220 Lvr H &a On-rwY- IS-ANY OIHYWIC- I P-c DN-FWE- IS-N Is-0 miWE- Is-s
m*-
lM-lW€- IS-OTHER
L H H W
:::II:::11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
::&A . .
. . .
. . . . . . . . . .
. . . . . .
. . . . . . ::II . . . . . . . . . . . . . . . .
>:,!. ii!i!i.
. . . . . . . . . .
I::::
_ .._m::: . . . . . . . . . . -. :. :. :. . . . . . . . . . .
NT-ACCESSIBILITV IW-NfVIE- Is-RLII IWE-IYVE-IS-fiRG IWE-rwE-IS-RSN IWE-PYYE-IS-ASP IWE-rwE-IS-cvs IWE-WE-IS-OLN IWE-NRE- Is-GLU IWE-NWE-IS-OLY IWE-WE-IS-HIS IWE-WWE- IS- ILE I WE-WWE- IS-LEU IDUE-WIME-IS-LVS IWE-wE-IS-El IWE-WHE-IS-PHE IWE-WIE-IS-PRO IDUE-WME-IS-SEA IW-WE-IS-THR SIWE-NRE-IS-TRP SIWE-NAE-IS-TVR SIWE-NAME-IS-VAL SIWE-NAIE-IS-HOH SIDE-WKL-I S-OTHER S IWE-CLRSS 1- IS-HMROPHOB IC SIWEURSSI-IS-CHRRO SIDUE-CLRSSl-IS-POLAR S I DUE-CLRSS 1- IS-unOtouC SIWE-CLRSS2-IS-NOPPDLRR SIWEURSS2-IS-POCAR SIWE-CLRSS2-IS-RCIDIC SIWE-CLRSS2-IS-BRSIC
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
:11
. . . . . . . . . .
. . . . . . . . . . ,
,
. .
,
,
,
.
,
.
,
. . . . . . . . .
.
..:::II11:
. . . . . . . . . . . , , . ,
. . . . . II::: . . . . . . . . I I. . . ." . '
-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SECDNMRV-STRUCTWIEl-IS-BRID~ SECMDARV-STRUCTWIEI-IS-STRIWD
SECDNMRV-STRUCTUREl-IS-TUW
SEMMARV-STRUCTLIREl-IS-BW
SECDNORRV-STRUCTUREl-IS-COIL
SECOE(DRRY-STRUCTURE1-lS-HET SEMYYWIRV-STRUCTURE 1- IS-UNKNOUN SECOmRRY-STRVCTURE2-lS-HELlX SECWUMFlV-STRUCTURE2-IS-BETA
SE~V-STRuCTwIE2-lS-COlL
SEtOWDARY-STRUCTURE2-IS-HET SECOE(DARY-STRUCTURE2-IS-Ur((NOU(
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fig. 4. Example comparison results of environments around amino acids. The left-hand column lists all the properties. All properties are analyzed at 1 A volumes up to 10A from the Cp carbon. The volumes for which the properties were significantly more prevalent for the first sites are marked in light gray, and dark gray for the second. "LYS versus A R G shows much fewer significantly different properties than "LYS versus VAL? indicating that LYS is much more similar to ARG than it is to VAL.
22 1
s 0 4 T 0 1 4 P - 1 0 0 4 A 0 0 0 0 4 0 0 0 0-1 4 G N 0 1 1 0 0 0 4 D -2 -1 -1 -1 -2 -2 0 4 E -2 -2 -1 -1 -3 -2 0 0 4 Q 0 0 1 0 0 0 1 - 1 0 4 H 0 - 1 - 1 0 0 - 1 0 - 1 -.1 0 R -1 -1 0 0 -1 -1 0 -1 -1 0 K -2 0 0 0 - 1 - 1 0 - 1 - 1 0 M 1 0 0 1 1 0 0-1-1 1 I 0 -1 0 0 0 -2 -1 -4 -4 0 L 0 -1 0 0 0 -1 -1 - 3 - 3 0 v 0 -1 0 0 1 -2 -1 -3 -4 -1 F 0-2-2 0 0-2-1-4-4 0 Y 0-1-1-1 0-2 0-3-4 1 w 0-1-1 0 0-2-1-3-2 0 C S T P A G N D E Q
4 0
4
2 4 0 0 0 -1 -2 -2 -1 -1 -2 -1 -2 -3 0-2-2 0 - 1 - 2 0 0-2 H R K 0
4
2 2 1 2 0 2 M
4 1 2 0
0 0 I
4 1 4 1 0 4 0 0 1 1 0 2 L V F
4 1 Y
4 W
Fig. 5 . The WAC substitution matrix computed by comparing the amino acid environments. The other half can be generated by symmetry.
the two environments are similar, then there are not many distinguishing features with high z-values. If they are different, then there are many distinguishing features. The summed difference of the amino acid pairs were used to create a 20x20 substitution matrix, shown in Fig. 5. To evaluate the performance of WAC, the sensitivities of three substitution matrices, WAC, BLOSUM62, and PAM250 [ 151 were compared in detecting related protein sequences in the same family (as defined in PROSITE [16]). Although BLOSUM62 was the most sensitive matrix overall, WAC was more sensitive for some families and showed overall performance similar to PAM250.
4.4. Threading a protein sequence onto a fold The final use of FEATURE’S statistical description that we will discuss is in the area of protein threading or fold recognition. Given a query sequence, we want to determine if it can adopt any of the known 3D folds in the PDB. In fold recognition, the query sequence is aligned to each structure in a library of folds. Using a fitness scoring function, an alignment algorithm searches for the sequence-fold alignment with the highest score. Then, the scores for the alignment of the sequence to each fold in the library are ranked to find the most probable fold for the sequence. For a step-by-step review of threading methods, see ref. [17] and chapter 12 of the present volume. Here, we will describe the set of choices we have made for our fold library, alignment methodology, and fitness function.
222
4.4.I. Library of folds The Library of Protein Family Cores (LPFC) is a resource with information about the folds adopted by protein families. It can be found at
http://www.smi.stanford.edu/project/helix/LPFC LPFC is based on the FSSP database, which clusters protein structures based on structural similarity. For each fold family, LPFC provides the multiple structural alignment of the protein structures within the family and a consensus backbone structure that includes all common aligned backbone positions that are found in each family member - called the core positions. Positions that are occupied by some - but not all - fold family members are not included in the set of core positions. If there is a sequence of consecutive core positions in the multiple alignment that all belong to the core, it is called a “core segment”. The core backbone positions provide the scaffold upon which a query sequence is mounted in order to assess its compatibility with the fold family. When all the members of a fold family have been structurally aligned, then each core position (centered at the a-carbon of the core amino acid) can be described statistically using FEATURE. The core positions are the sites, and randomly selected a-carbon environments are the nonsites. Thus, the fold family can be considered a string of distinctive microenvironments which can be used to recognize structure. Query sequences can be mounted (or threaded) onto the structural backbone of the core, and then evaluated to determine if their side chain interactions are similar to those observed in the natural structures. The description of the a-carbon environments can be used in two ways to identify matches between query sequences and family fold core models. In the first method, based on environmental profiles, the statistical description of each core position is compared to the statistical description of each amino acid (as described in previous sections) [4]. A score is determined to summarize the match between the two descriptions. For each core position, the environment of each structure in the fold family is extracted as sites. In addition, the environment of -100 instances of each amino acid is also collected. Then, the difference in the environments of each amino acid to the core position is scored in a manner similar to that used to construct the substitution matrix. The squared z-scores representing the statistical differences for each property of the two environments are summed to yield a score representing the total difference in the two environments. Repeating this for each core position yields a matrix that provides the score comparing each amino acid with each backbone environment. A new sequence can be aligned to the core backbone optimally using dynamic programming sequence alignment algorithms and the matrix provided by this analysis as a score matrix. If the score of a query sequence resembles the scores of sequences which are known to adopt the fold, then it is likely that the query sequence also adopts that fold. This algorithm is relatively simple to implement, but it loses information because it compares core backbone environments with average side chain environments, and not with the detailed side chain environments created when the query sequence is mounted onto the core backbone. The second fold recognition method is based on detailed scoring of a threaded structure with environmental descriptions. The statistical description of each core backbone
223
position is used to score a query sequence that has been aligned with the core backbone. The side chains of the query sequence are instantiated upon the backbone to produce a detailed atomic model. The compatibility of the atomic model and the statistical description at each core backbone position can be scored using our site recognition algorithm. One can view a protein structure as a string of microenvironments surrounding the a-carbons within the sequence. Given a set of aligned structures from a fold family, we can generate a statistical description of the environments surrounding each a-carbon. If we take a template backbone and mount the side chains of a new sequence upon the template, then we can compute the score of each a-carbon environment (created by the side chains as they hang off the template backbone), and sum these to get an overall measure of the compatibility of this sequence to the template backbone. This is the essence of how we use FEATURE for sensitive threading. After obtaining a structural alignment file from the LPFC database, we select structures with less than 25% sequence identity. Using the average backbone fold of the proteins in the multiple alignment, the structures within the fold family are superimposed, thus creating a set of environments around each a-carbon position. To extract the statistical environmental descriptions, FEATURE! is applied to the region surrounding each of the core a-carbons within the training proteins. For the nonsite controls, a-carbons are randomly selected from twenty nonhomologous proteins. FEATURE collects information about the key biochemical and biophysical features of the microenvironments around the aligned a-carbons, as compared with the randomly chosen a-carbons from these proteins as previously described [3,18]. The analysis is centered at the a-carbon of the sites and nonsites, and a threshold of p = 0.01 is used to determine if the differences in the environments for given properties at specific radii were statistically significant.
4.4.2. Alignments Since the core positions typically occur within a set of ungapped core segments, the task of generating an alignment is equivalent to generating loop lengths between the core segments. For each query sequence (that sequence not used in training the model), we generate possible alignments with the core. We generate random sample alignments using information about the average loop lengths gathered from the multiple alignment file. Using the multiple sequence alignment from the LPFC, we can compute the mean length of each known loop, its variance, and its covariance with other loop lengths. Assuming a multivariate, correlated normal distribution of loop lengths, we can generate sample gap lengths that follow a similar distribution as the known protein fold family members. Alternatively, we can use standard sequence alignments and then systematically vary them by shifting segments to the left and right to generate a good sample of plausible alignments of the query sequence with the known family members. With these alignments, we can search for high-scoring alignments, and determine whether any alignment scores high enough to reliably classify the query sequence as belonging to the fold family. 4.4.3. Generating and scoring actual and sample structures Given a test alignment, we can not score the environment around the a-carbon using FEATURE until the side chains of the query sequence are mounted upon the template
224
backbone. The SCWRL (Side Chain placement With a Rotamer Library) program uses a backbone-dependent rotamer library to place side chains onto a protein backbone [191. We therefore supply SCWRL with the core backbone positions and the alignment of the query sequence with the backbone. SCWRL positions the side chains onto the template backbone and then performs some optimizations to minimize steric clashes. The SCWF2L program for placing side chains is able, in our tests, to reproduce crystal structure side chains to about 1.5-2 rmsd. After attaching the side chains onto the core backbone, we use the scoring method to determine the compatibility of the newly created environments with those observed in real family members. It is important to note that the quality of the threading results is dependent on the quality of the structural alignment in the FSSP resource. Any inaccuracies in this alignment will affect the results. In fact, improvements in the alignment would most likely lead to a more pronounced difference in the score between the correct alignment and incorrect alignments, since the environmental descriptions would be more precisely defined.
5. Summary This chapter has described an approach for statistical analysis of the three-dimensional features within protein structures and has shown that the method can be used to: (1) characterize protein sites, (2) recognize protein sites in new structures, (3) compare amino acid environments, (4) generate an amino acid substitution matrix, (5) thread query sequence with environment profiles, and (6) thread query sequence with detailed environments. Until recently, much effort has focused on statistical analysis of sequence information. Although powerful methods have been developed, they are always limited by the reality that sequence is often a proxy for structure and function. Statistical analysis of aligned 3D structures has now become feasible as the number of experimentally determined structures increases. As the database of structures continues to grow, our ability to describe the conserved structural and functional features in a sensitive and specific manner will improve. The availability of large data sets will increase the power of the methods described in this chapter, as they are applied to problems of protein structure analysis and engineering. We also expect that analysis of structure will create new insights about function.
Acknowledgments This work has been supported in part under grants from the Culpeper Foundation, the National Institutes of Health (NIH-LM05652, LM-06244, LM-05305), the National Science Foundation (NSF-BIR9600637), and the IBM corporation. Steve Bagley, Allison Waugh and Michelle Whirl have participated in the construction andor testing of the methods described in this chapter. Kevin Lauderdale and Michelle Whirl have read and commented on the text. We also thank Dr. Simon Kasif and Dr. Steven Salzberg for their editorial comments.
225
References [l] Creighton, T.E. (1993) Proteins: Structures and Molecular Properties, W.H. Freeman and Company, New York. [2] Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F.J., Brice, M.C., Rodgers, J.R., Kennard, O., Shimanouchi, T. and Tasumi, M. (1977) J. Mol. Biol. 112, 535-542. [3] Bagley, S.C. and Altman, R.B. (1995) Protein Sci. 4, 622435. [4] Bowie, J.U., Luthy, R. and Eisenberg, D.(1991) Science 253, 164-170. [5] Bryant, S.H. and Lawrence, C.E. (1993) Proteins: Struct. Funct. Genet. 16, 92-112. [6] Sippl, M.J. and Weitckus, S. (1992) Proteins: Struct. Funct. and Genet. 13, 258-271. [7] Wei, L., Altman, R.B. and Chang, J.T. (1997) In: R.B. Altman, A.K. Dunker, L. Hunter and T.E. Klein (Eds.), Proc. Pacific Symp. on Biocomputing. World Scientific, Singapore, pp. 465476. [8] Ott, L.R. (1993) An Introduction to Statistical Methods and Data Analysis, Wadsworth Publishing Company, Belmont, CA 94002. [9] Glantz, S.A. (1987) Primer of Biostatistics, McGraw-Hill, New York. [lo] Bagley, S.C., Wei, L., Cheng, C. and Altman, R.B. (1995) In: C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer and T. Wodak (Eds.), 3rd Int. Conf. on Intelligent Systems for Molecular Biology. AAAI Press, Cambridge, pp. 12-20. [ l l ] Rossmann, M.G., Moras, D. and Olsen, K.W. (1974) Nature 250, 194-199. [12] Wei, L. and Altman, R.B. (1998) In: R.B. Altman, A.K. Dunker, L. Hunter and T.E. Klein (Eds.), Proc. Pacific Symp. on Biocomputing. World Scientific, Singapore, pp. 497-508. [I31 Tsunasawa, S.,Masaki, T., Hirose, M., Soejima, M. and Sakiyama, F. (1989) J. Biol. Chem. 264, 38323839. [14] Holm, L. and Sander, C. (1994) Nucleic Acids Res. 22, 360Cb3609. [15] Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) Atlas Protein Sequence Struct. S(Supp1. 3), 345-352. [16] Bairoch, A. (1991) Nucleic Acids Res. I9(Suppl.), 2241-2245. [17] Fischer, D.,Rice, D., Bowie, J.U. and Eisenberg, D. (1996) Faseb J. 10, 126-136. [18] Altman, R.B., Whirl, M., Waugh, A,, Wei, L. and Chang, J.T. (1997) Stanford Medical Informatics Technical Report SMI-974682. [19] Bower, M.J., Cohen, F.E. and Dunbrack, R.L.J. (1997) J. Mol. Biol. 267, 1268-1282.