BioSystems 77 (2004) 195–212
Cluster and information entropy patterns in immunoglobulin complementarity determining regions Stephanie Culler a , Tai R. Hsiao a , Mark Glassy a,b,c , Pao C. Chau a,∗ a c
Chemical Engineering Program, University of California, San Diego La Jolla, CA 92093, USA b Shantha West Inc., San Diego, CA 92121, USA The Rajko Medenica Research Foundation, 10246 Parkdale Avnue, San Diego, CA 92126, USA Received 29 January 2004; received in revised form 27 May 2004; accepted 27 May 2004
Abstract Previous studies of antibody binding domains have established many crucial features that include important structural positions, canonical formations, and the geometric correlations with the binding site nature and topography. In this work, position-specific frequency and hierarchical clustering analysis are used to explore the statistical pattern of the residues in the complementarity determining regions of human antibodies. In addition, Shannon’s information entropy is computed for the entire heavy and light chains and compared with germline patterns to seek variability due to antibody clonal selection. Results are compared with reported analyses based on structural data and ligand-protein contact point computations based on Protein Data Bank records. Observations derived from the present sequence analysis are consistent with previous structural based methods. In the absence of structural data, methods used in this work can be effective and efficient computational tools used for identifying residues that are important for antigen targeting and predicting the probable amino acid distribution expected at these positions. The results in turn can be applied to help design or plan mutagenesis experiments to improve the binding properties of antibodies. © 2004 Elsevier Ireland Ltd. All rights reserved. Keywords: Immunoglobulins; Complementarity determining regions; Hierarchical cluster analysis; Information entropy; Molecular recognition
1. Introduction The immune system through sequence and configurational diversity has the capability of generating antibodies that target an almost infinite array of chemical structures. The variability of an antibody is enabled largely by six hypervariable complementarity determining regions (CDRs) that form the antigen-combining site. In addition to the chemical characteristics of the surface residues, the specificity and affinity is also determined by the structure of the
∗
Corresponding author. Tel.: +1 858 534 6935. E-mail address:
[email protected] (P.C. Chau).
CDR loops, and to a lesser extent, the framework regions. The variable regions together provide a surface profile structurally and chemically complementary to an antigen binding site. It is clear that binding properties of matured antibodies are driven by the nature of antigen recognition as well as genetic and structural factors (Almagro et al., 1996). Despite the variability, many aspects of the antibody molecule are quite invariant in both sequence and structure, including the -sheet framework that supports the CDRs. The Fv antibody fragment has roughly 230 residues, of which about 70 constitute the six CDRs. Even here, many CDRs adopt a limited set of canonical backbone conformations that are determined by the length of the loops and the presence
0303-2647/$ – see front matter © 2004 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2004.05.033
196
S. Culler et al. / BioSystems 77 (2004) 195–212
of certain key structure-determining residues (Chothia and Lesk, 1987). One of the main goals in the study of engineering antibodies is to determine how an antibody could be modified to improve its targeting against an antigen of interest. The answer is partially embedded within the sequence data and the pattern of selection from germlines. Sequence analysis, which is computationally less expensive than detailed modeling of antibody-antigen interactions, could provide insight into the specificity and affinity of immune recognition. The result would help to predict binding and design improvement strategies prior to the availability of structural information. In a broader perspective, these insights have important implications in the study of molecular recognition by proteins, and in understanding the evolution of the binding repertoire in many medical problems, including autoimmune diseases and cancers. There has been a longstanding interest in studying antibody binding and understanding how antibody specificities are related to sequences. Beginning with the identification of key structural positions by Kabat et al. (1977), many additional studies have attempted to establish relationships among sequences and the three dimensional structures of antibody binding sites, the properties of stability, specificity and affinity (Al-Lazikani et al., 1997; Collis et al., 2003; Decanniere et al., 2000; MacCallum et al., 1996; Ramirez-Benitez and Almagro, 2001; Wu et al., 1993). From these studies, it is now known that amino acid usage is not uniform, the length distribution of the CDRs is correlated to the nature of the antigen, and that there are only a select few contact points. Furthermore, only a relatively small number of residue substitutions during somatic hypermutation are needed to arrive at the eventual matured antibody (Boder et al., 2000; Kirkham et al., 1999; Steipe et al., 1994; Wedemayer et al., 1997). Knappik et al. (2000) showed that by focusing mutagenesis at positions that are most likely in contact with the antigen, one could theoretically reduce the antibody library search space of 2 × 109 down to 1.3 × 106 . From a statistical perspective, Lara-Ochoa et al. (1994) demonstrated that amino acid usage within hypervariable regions CDR-1 and CDR-2 follow either a power-law or an exponential position-specific rank-order distribution that can be ascribed to either structural or recogni-
tion properties. Based on frequency computations, Knappik et al. (2000) designed seven master VH and VL genes to serve as a synthetic human combinatorial library. Their work again suggests that the pragmatic search space is much smaller than the theoretical possibilities. Most of the seminal CDR analyses utilized structural data, which are not always available. It would be of importance if one could derive similar insight based on only sequence information, and with computational algorithms that can accomplish the task efficiently. Hence the objective of this work is to investigate the amino acid usage in the CDRs, and explore statistical patterns that one may subsequently use to narrow the search space for antibody designs. Specifically, the present work makes use of position-specific frequencies, hierarchical clustering, and information entropy—all quantities that can be computed quickly and easily. From an earlier analysis, Shenkin et al. (1991) had established that information entropy is superior to the Kabat-Wu index in the analysis of immunoglobulin sequence variability. Interpretations of the statistical patterns are related to published structural information. The overall results complement important structural analysis and together the information can be used to improve antibody design strategies and functional predictions.
2. Methods Sequence data were downloaded as a flat file from the August 2003 release of the International ImMunoGeneTics Information System (IMGT, http://imgt.cines.fr:8104/) sequence database. Human immunoglobulin sequences were extracted and the Kabat definitions of the CDRs were used: CDR-L1, L24-34; CDR-L2, L50-L56; CDR-L3, L89-L97; CDR-H1, H30-H35, CDR-H2, H50-H65; CDR-H3, H95-H102 (Martin, 2001). A total of 692 light and heavy chain sequences were extracted and sanitized prior to analysis. Duplicate records, especially those with completely identical CDR regions, and uncertain records with “X” or “?“ characters were rejected. Incomplete sequence records that could not be parsed and sequences that deviated significantly from the expected CDR patterns were also omitted. The final cleaned dataset contained 157 heavy and 144 light
S. Culler et al. / BioSystems 77 (2004) 195–212
chains. These records with complete annotation were imported into relational database tables designed to facilitate subsequent sequence query and analysis. MySQL was used as the local database server running on Linux. All subsequent analysis is based on data retrieved from this local database. All data manipulation was performed using scripts written in Perl, with PHP also used for Web-based activities and functionalities. Statistical analysis (frequency computation, hierarchical clustering) was performed using R (http://www.r-project.org). The hierarchical clustering algorithm used an average linkage and Euclidean distance. At each stage of the clustering, clusters were recomputed by the Lance–Williams dissimilarity update formula. Shannon’s information entropy was computed with the use of pseudo counts in the frequency of amino acid a appearing at position j (Henikoff and Henikoff, 1996): f(a, j) =
n(a, j) + b(a, j) , N(j) + B(j)
Contact point analysis was performed on 46 heavy and 43 light sanitized PDB entries of antibody-antigen complexes from various species using the LigandProtein Contacts (LPC) Program provided by Vladimir Sobolev (Sobolev et al., 1999). Based on the total CDR contact points within the dataset, the percent occurrence of contacts made by each residue was put in terms of a contact index.
3. Results Our analysis is divided into two main categories— clustering and entropy computations. Evaluation of position-specific frequency is part of the clustering calculations. Contact point analysis and structural implications of the statistical patterns are presented in Section 4. 3.1. Clustering analysis
(1)
where n(a, j) is the number of specific amino acid a at position j in a CDR, N(j) is the total number of amino acids at each position j, b(a, j) is the pseudo-counts of amino acid a at location j, and B(j) = 20 b(k, j) k=1 is the total pseudo-counts at j. In this work, the Gibbs √ sampler, B(j) = N, was chosen as the pseudo-count. Finally, the information entropy at each position j is computed as 20 S(j) = − f(k, j)lnf(k, j),
197
(2)
k=1
where the natural logarithm has replaced the more formal base 2 definition. Germline sequences were downloaded from the VBASE sequence library (Centre for Protein Engineering, http://www.mrc-cpe.cam.ac.uk), and 58 both and light, and 48 heavy germline sequences were used. The heavy chain clone sample set was made of entries with accession numbers E35211, E35212, E35215, E35216-E35218 and E35220, which are specific against parathyroid hormone. The light chain clone sample set included entries E12553, E12555, E12557, E12559, E12913 and E12918, which are specific against hepatitis B virus.
The hierarchical clustering analysis of the CDRs was performed by positions and by amino acids. In making interpretations, one must keep in mind that observations merely reflect statistical trends of the sample population. For each CDR, three plots are used to illustrate statistical patterns: (1) a position-specific frequency chart, (2) a dendrogram by positions, and (3) a dendrogram by amino acids. The Kabat sequence numbering scheme and CDR definitions are used throughout in the present work. The amino acid usage is normalized with respect to each position as in the computation of position-specific scoring matrices. Thus frequencies in Figs. 1 and 2 represent position-specific usage and not general utilization within a CDR. Here, the frequencies of all amino acids in each position sum to one, and they do not represent overall usage within an entire CDR.1 Because of the use of position-specific frequencies, one must be careful in interpreting the so-called insertion sequence positions in the numbering scheme. The lightly populated insertion numbers include LV27B-F in CDR-L1, LV95A-F in CDR-L3, HV35A-B in CDR-H1, HV52B-HV52C 1 Together with other supporting material, general usage frequencies are provided as Web Supplement. Available at http://pcclab.ucsd.edu/cdr/.
198
S. Culler et al. / BioSystems 77 (2004) 195–212
Fig. 1. Histograms of the position-specific frequency of amino acid usage within each light chain CDR region. On the axis, the amino acids are arranged by types. The amino acid usage is normalized with respect to each position. The frequencies of all amino acids in each position sum to one, and do not represent overall usage within an entire CDR.
Fig. 2. Histograms of the position-specific frequency of amino acid usage within each heavy chain CDR region. See Fig. 1 caption comments.
S. Culler et al. / BioSystems 77 (2004) 195–212
in CDR-H2, and HV100B-HV100R in CDR-H3. A relatively few immunoglobulin records have longer than nominal CDRs with residues in these insertion positions. 3.1.1. CDR-L1 There are a handful of notable observations from the examination of the position-specific frequency chart (Fig. 1a). Among hydrophobic residues, A, I, L, V, and G are more often utilized, with A and L appearing at the both ends of the CDR, while I, V, and G are more broadly distributed, especially in the middle region. Interestingly, A is present mostly at LV25 and LV34, while L occurs almost exclusively at LV33. Position LV33 can be categorized as nonpolar due to extensive L usage. Other hydrophobic amino acids, including P, W, and F barely occur in CDR-L1, while M and C are actually absent. It is no surprise that C is seldom used, as in other CDR regions. In total, hydrophobic residues account for approximately a third of the amino acids used in CDR1-L1. Among hydrophilic residues, S is most common, which is also the case in CDR-L2, CDR-H1, and CDR-H2. In CDR-L1, S exhibits a bimodal-like distribution, concentrated near the two ends of the region, while N is more broadly distributed near the end of the region. Several other amino acids also exhibit position-specific association. For example, R and Q, when present, are mostly at LV24 and LV27, while Y is mainly at LV32. The acidic and basic amino acids altogether account for only about 10% of the total amino acids usage. This low frequency of charged polar residues is also prevalent in CDR-L2, CDR-L3, and CDR-H1. Using the position clustering dendrogram (Fig. 3a) and the frequency chart together, CDR-L1 contains two polar patches that heavily utilize amino acids with hydroxyl and amide side chains. The first one is at LV26-27 near the beginning of the CDR, where S and Q occur predominantly at these two positions. With the exception of heavy A usage at LV25, the first five residues in CDR-L1 are polar in nature. When there are insertions at LV27A, almost all entries contain S. To a large extent, these residues represent the motif RASQ(S) that spans LV24-27 and 27A. LV30, which also has a heavy S usage, forms a hydrophilic cluster together with LV27A. The residues involved with the motif predominate in the left hand side of the position
199
dendrogram where amino acids with general heavy usage tend to aggregate. The second polar patch at LV30-32 is less obvious in the dendrogram. Here, the residues are highly represented by S and Y that contain a hydroxyl group side chain and N with an amide side group. Two positions in CDR-L1 exhibit extensive usage of nonpolar amino acids. They are LV25 consisting mainly of A and G, and LV33, which is predominantly K. In the position dendrogram, LV25 and LV34 form a cluster because of their similar heavy usage of A. LV25 is a member of the RASQ(S) motif. While LV34 varies in its hydrophilicity, it contains mainly A when it is occupied by a hydrophobic residue. The multiple small clusters are attributed to the lightly populated insertion positions, LV27B-F. The amino acid dendrogram (Fig. 3d) lacks any significant clusters. The tree essentially represents the rank order usage in the region. The amino acids on the left of the dendrogram, from S to Y, are utilized often, and interestingly their occurrence, from site-specific frequency, is concentrated at selected few positions. In the right half, the branches from V to C consist of residues that are not used much. These under-represented residues are a mix of polar and nonpolar residues that have no discernible pattern. The most often used residue S is a close neighbor with the other amino acids in the RASQ(S) motif, which explains why A appears oddly to be associated with the hydrophilic residues. 3.1.2. CDR-L2 There are a few noteworthy features in the frequency chart of the relatively short CDR-L2 (Fig. 1b). With the two more utilized hydrophobic residues, A has a bimodal-like distribution due to its usage near the two ends (LV50-51 and LV55), while L occurs mainly at LV54. Other hydrophobic amino acids, including V, I, P, W, and F are hardly present in CDR-L2. As is the case with CDR-L1, C and M are absent. Similar to CDR-L1, hydrophobic residues account for approximately a third of the amino acids used in CDR2-L1. Of the hydrophilic residues, S has a relatively broad distribution but with site-specific associations at LV52 and LV56. Other residues exhibiting position-specific association include T at LV53 and LV56, Q at LV55, R at LV54, and N is skewed toward the front end. While S is common, Y is barely present.
200 S. Culler et al. / BioSystems 77 (2004) 195–212
Fig. 3. Position dendrograms (a–c) resulting from hierarchical cluster analysis using average linkage and euclidean distance, and amino acid dendrograms (d–f) for each CDR in the light chain.
S. Culler et al. / BioSystems 77 (2004) 195–212
In the position dendrogram (Fig. 3b), LV52-53 and LV56 form a polar cluster. These positions are occupied primarily by S, T, and N. LV53 and LV56 form a subgroup because of their similar heavy usage of S and T. Although residues LV50 and LV55 vary in degree of hydrophilicity, they form a cluster as a result of comparable usage of A. Among other residues, LV51 is nonpolar in nature because of its heavy usage of A and other hydrophobic amino acids. The highly abundant amino acids S, A, L, R, and T appear on the left side of the amino acid dendrogram (Fig. 3e), and they occupy mostly a small number of selected positions. The branches from V to I consist of residues that are not used much. This pattern is very similar to that in CDR-L1. Again, the cluster of under-represented residues is a mix of both polar and nonpolar residues. There are three small clusters that are due to similar amino acid usage at positions that vary in hydrophilicity. First, hydrophobic L and hydrophilic R form a cluster because of their comparable usage at LV54, imparting a variable character to this position. Likewise, E and P form a cluster due to their comparable usage at LV55. Hydrophobic G and basic D form a small cluster because of their similar usage at LV50. 3.1.3. CDR-L3 As expected, this region is quite variable (Fig. 1c). Hydrophobic residues G, L, and W are broadly distributed. However, P appears mostly at LV95, followed by LV95A. LV95 also has a high usage of L, making this position nonpolar. Other hydrophobic amino acids, including A, V, I, L, F, C, and M, hardly occur in this region. Similar to both CDR-L1 and CDR-L2, nonpolar amino acids constitute approximately a third of the residues used in CDR-L3. Among hydrophilic amino acids, Q is most common and is present heavily at LV89 and LV90. This usage is how these two positions form a polar cluster (Fig. 3c). Other amino acids including N, Y, and S are distributed more broadly, while S also exhibits a heavy usage among LV90-94. Similarly T exhibits position-specific association at LV97. Overall, there is one large polar patch that extends from LV89 to LV93. All probable insertion positions LV95A-F are lightly or not populated and constitute the multiple small clusters.
201
In the amino acid dendrogram (Fig. 3f), the abundant amino acids at the left of the dendrogram, from Q to Y, are localized at relatively few positions. There are two notable clusters. First, W and L form a nonpolar cluster because of their similar presence at LV96. Amino acids N, G, and D form a cluster because of their comparable usage at LV92. The branches from H to K consist of residues that are rarely used. 3.1.4. CDR-H1 With respect to the position-specific frequencies (Fig. 2a), the hydrophobic residues G, F, and M are utilized more often in CDR-H1. Other hydrophobic amino acids, including V, P, W, and C are hardly used. Residue G is broadly distributed with strong site-specific association at HV26, and also at HV35B when there are insertions. The distribution of F is skewed toward the first half of the region, while M exhibits site-specific association at HV34. Approximately just over half of the amino acids used in this region are polar. Among hydrophilic residues, S, T and Y are broadly distributed, but H, the sole prominent charged residue, exhibits position-specific behavior at HV35. There is a polar patch at HV30-HV32 that heavily utilizes amino acids with hydroxyl side chains. HV30 and HV31 mainly consist of S and form a subgroup (Fig. 3a), while HV32 predominantly utilizes Y. Other predominately polar positions include HV28 with a heavy usage of T, and HV35. A nonpolar patch is apparent at HV26-HV27, where HV26 almost always contains G, and F predominates at HV27. HV29, which also is consisted largely of F, form a cluster with HV27. Interestingly at HV27, Y, another aromatic residue, is used about half as much as F, which could possibly be seen as a substitution serving both structural and binding purposes. Among other positions, HV34 is nonpolar; it is taken up by I and the rarely used M. As a result, I and M form a nonpolar group in the amino acid dendrogram (Fig. 3d). Insertion positions HV35A and HV35B form a cluster because they are lightly populated, and HV33 is a close neighbor with them because HV33 is completely dissimilar to the other positions due to its high variability. Residues HV35, HV30, and HV31 form a hydrogen bonding cluster as a result of comparable S usage.
202
S. Culler et al. / BioSystems 77 (2004) 195–212
Among the amino acid clusters, S and T form a hydrophilic group that is close in distance to Y primarily based on their broad distribution throughout CDR-H1. As aforementioned, the often utilized G, F, and Y are grouped on the left side of the dendrogram. The branches from V to Q consist of residues that are barely used in CDR-H1. 3.1.5. CDR-H2 There is significant variability in the relatively long CDR-H2 (Fig. 2b). Among hydrophobic residues, G, A, V, I, and F are common and broadly distributed. Nonetheless, some of them exhibit position-specific behavior: G at HV65, A at HV60, V at HV63, and F at HV63. Other residues, including M, are barely present and C is absent. Nonpolar amino acids constitute roughly 40% of the residues used in this region. Among hydrophilic residues, S is broadly distributed, and N is weighted more in the first half of the region. A few other residues exhibit position-specific behavior: Q is present mostly at HV61 and HV64, T predominately at HV57, and Y at HV59. As for the acidic amino acids, D is fairly broadly distributed, but E is barely used. For the basic amino acids, K is utilized more often in the second half of the region, while H and R are seldom used. Close to a tenth of the amino acids in CDR-H2 are basic. There are two apparent polar patches. The first one is from HV56-HV59 where T and Y predominate. In fact, HV59 is occupied by Y in 95% of the records. The second polar patch appears in HV61-HV62. Interestingly, HV61 contains either an acidic residue, D or Q with an amide side chain, while HV62 contains either a basic residue K, or a hydrogen bonding S. Except for HV51, the residues in the front half of CDR-H2 (HV50-55) form small cluster groups that are variable in hydrophilicity (Fig. 3b). The three distinctly nonpolar positions include HV51, which mainly contains I, HV63 because of its usage of V and F, and HV65 with its heavy G usage. Residues HV65, HV54 and HV55 form a cluster based on their similar usage of G. Residues HV52 and HV56 form a cluster because of their similar pattern of amino acid usage. All the probable insertion residues HV52A-HV52C are lightly populated and form a cluster with HV50 because its variable usage of amino acids is dissimilar to any other positions.
There are two noteworthy amino acid clusters (Fig. 3e). Nonpolar V and F form a cluster because of their comparable usage at HV63. In addition, D and Q form a cluster due to their similar utilization at HV61. The amino acids on the left of the dendrogram, from G to T, are used extensively in the region. The branches from L to C consist of residues that are under-represented. 3.1.6. CDR-H3 All amino acids are represented in the highly variable CDR-H3, with polar and nonpolar residues fairly evenly divided (Fig. 2c). Among hydrophobic amino acids, G, A, V, L, P, and F are utilized more often and appear to be broadly distributed. However, C barely occurs. Most records in the dataset vary in length between 10 and 14 residues, with very few entries having insertions from 100H to 100J. Among hydrophilic amino acids, S and Y are used extensively, and Y is especially heavy at HV102. However, N and Q are utilized much less, demonstrating possibly a lesser reliance on amide interactions being made by CDR-H3. Of the acidic amino acids, D is skewed toward the beginning of the region and exhibits site-specific association at HV101. Consequently, roughly a third of the amino acids appearing in CDR-H3 are acidic. Furthermore, R is scattered throughout the region while the other basic amino acids, H and K are not utilized to a significant extent. In the position clustering dendrogram (Fig. 4c), HV101, HV102, and HV95 are singled out on the left side because a handful of amino acids demonstrate moderate position-specific behavior, especially D appearing at HV101, making it the only predominately polar residue in CDR-H3. Positions HV96-HV98 form a cluster because of their similar overall usage of amino acids. The rest of the dendrogram, which mainly involve insertion positions, are much too variable to extract meaningful patterns. Residues I and V form a nonpolar cluster (Fig. 4f) because of their similar appearance at HV102, which may reflect their structural similarity. Moreover, P and L form another nonpolar cluster due to a slightly comparable usage at various positions throughout the region. Amino acids D, Y, and G appear at the far left of the dendrogram because of their extensive usage, and again, D specifically at HV101. Residues W through
S. Culler et al. / BioSystems 77 (2004) 195–212 Fig. 4. Position dendrograms (a–c) resulting from hierarchical cluster analysis using average linkage and euclidean distance, and amino acid dendrograms (d–f) for each CDR in the heavy chain.
203
204
S. Culler et al. / BioSystems 77 (2004) 195–212
Q appear at the right of the dendrogram because of their relatively light usage. 3.2. Entropy profiles Site entropy is a measure of amino acid substitutions that can be made at each residue without disrupting the structure, and in the CDRs may also reflect plasticity to antigen driven selection. This statistical free energy quantifies probable selection at a position: the deviation of the observed distribution from randomness. Entropy analysis could be used to identify beneficial mutations and guide mutagenesis experiments (Voigt et al., 2001). In the framework region, the amino acid pattern is important for the stability of the immunoglobulin variable domain (Ewert et al., 2003). Variability in the CDRs reflects more strongly the selection for antigen recognition, and high entropy positions may include probable contact points and neighborhood residues. It is known that improving the affinity of antibodies should include residues other than those directly interacting with the antigen (Boder et al., 2000; Marvin and Lowman, 2003; Ramirez-Benitez and Almagro, 2001). In this work, information entropy was computed for each position in both the light and heavy chains (Figs. 5 and 6). Values in the CDRs are also included in Table 1 for later discussions. Lines representing one standard deviation from the antibody sample mean are added as a rough visual guide; they should not be in-
terpreted as criteria for defining high or low entropies or absolute targets for site-directed mutagenesis. Generally, the CDR regions are more susceptible to substitution as expected, but one cannot ignore the fact that residues in the framework region may have relatively high entropy values. Partly, the framework regions are also subject to somatic hypermutation, but more importantly, the variability shown here is a result of combining many sequences evolved from different germline subgroups. Hence the entropy profiles for the germline sequences and a sample set of clones within one subgroup are also calculated in each case and included in the plots for comparison. Within a clone set, the fluctuations are damped significantly, while the germline entropy reflects variability in different subgroups. It is notable that the fluctuations in all the profiles are essentially in phase, pointing to the scenario that structural and recognition residues retain their roles through the B cell maturation and selection process. Along a given profile, conserved positions obviously have low entropy values, but these values may also be artifacts introduced by those lightly populated insertion sequence positions that significantly reduce the probability of amino acid substitutions. Examples are the regions LV27A-27F in CDR-L1, LV95C-95F in CDR-L3, HV35A-35B in CDR-H1, HV52B-52C in CDR-H2, and HV100G-100R in CDR-H3. The same observation can be made with the profiles from the germlines and the clone sample set. In the presenta-
Fig. 5. The sequence entropy profile for light chain antibodies (bold line), V and VL germline sequences (thin), and clone dataset (dash). Each CDR region is highlighted in gray. Sequence number is used here instead of the Kabat numbering for labeling convenience.
S. Culler et al. / BioSystems 77 (2004) 195–212
205
Fig. 6. The sequence entropy profile for heavy chain antibodies (bold line), VH germline sequences (thin), and heavy chain clone dataset (dash). Each CDR region is highlighted in gray. Sequence number is used here instead of the Kabat numbering for labeling convenience. Table 1 Summary of characteristic features of the light and heavy chain CDR positions Contact site CDR-L1 24 25 26 27 27A 27B-F omitted 28 29 30 31 32 33 34
Entropy
Contact Indexa
Hb
1.33 1.14 0.76 1.23 0.80
1.8 0.9 6.4 3.7
P N P P P
2.05 1.96 1.87 1.87 1.59 0.95 1.79
7.3 3.7 1.8 5.5 1.8 3.7
CDR-L2 50 51 52 53 54 55 56
2.15 1.42 0.92 1.88 1.23 1.81 1.11
0.9
CDR-L3 89 90 91 92 93 94 95 95A
1.47 1.32 2.12 2.24 1.88 2.25 1.30 1.17
4.6 9.2 8.3 5.5 7.3 5.5 4.6 2.8
P P P N
Ac
SDd
+ +
+
+ + + + +
+ + + + +
0.9
N P P
Se
+ + + +
+
+
+ +
N
+ + +
+
+
+ + +
P + +
+ + + +
+
+ +
P P P P P
Cf
+
+
+ +
+
+ + + +
206
S. Culler et al. / BioSystems 77 (2004) 195–212
Table 1 (Continued ) Contact site
Entropy
Contact Indexa
95B-F omitted 96 97
2.35 1.13
0.9
CDR-H1 26 27 28 29 30 31 32 33 34 35 35A-B omitted
0.52 1.75 1.29 1.14 1.61 1.99 1.65 2.39 1.76 1.83
Hb
Ac
P
1.5 0.7 0.7 2.2 1.5 11 2.2 11
CDR-H2 50 51 52 52A
2.37 1.17 2.31 2.24
10.4 0.7 5.2 0.7
52B-C omitted 53 54 55 56 57 58 59 60 61 62 63 64 65
2.46 1.86 1.54 2.38 1.73 2.12 0.84 1.41 1.68 1.43 1.38 1.31 0.87
3.7 0.7 0.7 2.2 3.0 3.7
CDR-H3 95 96 97 98 99 100 100A
2.36 2.75 2.67 2.67 2.74 2.61 2.60
6.7 14 4.4 3.7 4.4 1.5 0.7
100B-R omitted 101 102
1.33 2.00
1.5 0.7
N N P N P P P
SDd
Se
Cf
+ +
+
+
+ + + + + + +
N P
+ + + +
+ +
+ N
P P P P
+
+ +
+ + + +
+ +
+
P P N P N
+
+ + + + + + P
+
+ +
+
+
+ + + + + + + + +
+
a Percent occurrence as contact residues in the PDB dataset using LPC. Blank means zero contact. Total may not sum to 100 because of omitted insertion positions in the table entries. b General hydrophilicity: P = polar; N = nonpolar; blank = variable. c Solvent accessible residues (Chothia and Lesk, 1987). d Specificity-determining residues based on structural variability (Padlan et al., 1995). e Structural residues (Chothia and Lesk, 1987). f Conserved positions (Kabat et al., 1977).
S. Culler et al. / BioSystems 77 (2004) 195–212
tion below, low values at the insertion positions are generally omitted. The attention is on residues that may have implications in structural stability or antigen recognition. 3.2.1. CDR-L1 Along the light chain antibody profile (Fig. 5), the low entropy positions at LV26 and LV33 are mainly due to, respectively, heavy S and L usage. Both positions appear to have similar amino acid usage in the germline sequences. Other low entropy positions (LV24, LV25, LV27) generally rely heavily on a few amino acids. Residues LV28-LV32 and LV34 have diversified amino acid usages and this fact is reflected in their high entropy values. For example, as many as 10 different amino acids are used at LV28. Within the germline profile, LV24-LV25 and LV27 are high entropy positions where many amino acids appear in half of the sequences and in the other half, only one type of residue is used. Specifically, R at LV24, A at LV25, and Q at LV27 are residues that appear in about half of the records. It appears that many antibodies have inherited these residues from the germlines. The highly variable region LV30-LV32 is a polar patch that is characterized by the high prevalence of hydrophilic amino acids. Interestingly, the highentropy positions for the clone data set include LV28 and LV30-32 which are also the highest entropy positions in the antibodies profile. These positions are variable in the clone sequences because the amino acid substitutions can range from three to four for these six entries. 3.2.2. CDR-L2 Residue LV52 has the lowest entropy within CDRL2 because this position is occupied mainly by S as is the case in the germlines where the percentage of S is even higher. With the exception of LV56, the clone sample set has little substitution and thus a low entropy profile in CDR-L2. Residues LV54 and LV56 have fairly average entropy values because a few amino acids predominate at these positions. At LV54, either L or R is used heavily in the antibody sequences and germlines, and at LV56, either S or T appears, while S is more predominant in the germline sequences. Positions LV50, LV53, and LV55 all have high entropy values in both antibody sequences and the
207
germlines. At LV50, both antibody sequences and the germlines utilize some 14 different amino acids, and from the antibody position dendrogram, this position has been characterized as variable in hydrophilicity. Likewise, LV55 cannot be characterized as either hydrophobic or hydrophilic. Even though LV53 contains over 10 different types of amino acids, about 96% of them are hydrophilic. Among the germlines, LV51 has relatively high entropy because several different amino acids are used, but A usage is predominant here in the antibody sequences. 3.2.3. CDR-L3 As expected, the most variable region is CDR-L3. Except for the lightly populated insertion positions, the lowest entropy position is LV97 and it has a slightly above average value due to its extensive usage of T and light utilization of about five other residues. Positions LV91-LV94 and LV96 have extremely high entropy, among the highest of the entire light chain. LV91 and LV92 are a part of the polar patch identified earlier and they utilize more than 12 different amino acids. Both LV94 and LV96 utilize over 14 different amino acids, consistent with being identified as variable in hydrophilicity in the position dendrogram. 3.2.4. CDR-H1 Along the heavy chain (Fig. 6), the extraordinary low entropy regions are again due to the barely populated insertion sequence numbers. Noticeably, HV26 has low entropy. Here, G occurs in about 98% of the antibody sequences, and all of the germlines begin with G in CDR-H1. Among antibodies, HV33 has the highest entropy in CDR-H1 as a result of utilizing some 17 different amino acids. As aforementioned in the cluster analysis, HV33 is grouped by the algorithm with the barely inhabited HV35A and HV35B because their amino acid usages are so dissimilar to others in the region. On the other hand, even though HV31 and HV35 utilize more than 11 different residues, the usage has a high frequency of polar amino acids and they form a hydrophilic cluster with HV30 in the position dendrogram. Among the high entropy positions, HV30 and HV31 in the germlines are populated mainly by one or two different residues and have lower entropy than the antibodies profile, suggesting that these positions may
208
S. Culler et al. / BioSystems 77 (2004) 195–212
be frequent targets of somatic hypermutation. Even within the chosen clone set example, HV31 has three amino acid substitutions. 3.2.5. CDR-H2 This region is quite variable, consistent with the patterns in the cluster analysis. Disregarding the insertion sequence numbers, HV59 has the lowest entropy as aforementioned that this position is almost always occupied by Y in both antibodies and germlines. Among the high entropy positions, HV50, HV52, HV52A, and HV53 are in the variable patch identified in the position dendrogram. Similarly, HV56 and HV58 are within the polar patch identified in the clustering. The high entropy positions are also reflected in the germlines, but the germlines also have relatively high entropy values at HV60 and HV63. The profile of the clone sample set follows that of the antibodies closely. All substitutions among the clones occur in the high entropy regions. 3.2.6. CDR-H3 The entropy reflects that this is a hypervariable region, especially at HV95-HV100F and HV102. Residue HV101 has slightly lower entropy than other CDR-H3 positions because of its extensive usage of D. Germline heavy sequences do not have a CDR-H3 region. The entropy values within the clone sample set are much lower, but the profile exhibit similar trend when compared with the antibodies.
4. Discussion By using position-specific frequencies and cluster analysis, we can recognize site-specific patterns leading to the observation of hydrophobic and hydrophilic patches. Together with information entropy profiles, the present approach may be used to illuminate the antibody selection process. The high entropy positions that deviate from the germlines may be a consequence of the immune recognition process. Mutations are focused at “hot-spots” which tend to occur in antigen binding loops and in particular the first hypervariable loop of human VH and V genes (Winter, 1998). The probability that beneficial mutants could be found increases when high entropy positions are targeted. Information from the entropy profile can be incorporated
in experimental designs. Site saturation mutagenesis can be applied with more discretion at positions that are predicted to be the most tolerant, or here, more crucial in antigen targeting. An important question is whether insights from an analysis that is based on statistical pattern of sequence information may have relevance in practice. To this end, we need to compare the present results with analyses based on structural data. As early as 1977, Kabat et al. determined that there are conserved positions in the light and heavy chain CDR regions that function as important structural elements. Structural residues are correlated with positions of low variability and with positions defining the antibody canonical classes (Chothia and Lesk, 1987). In contrast, structural analysis also revealed that residues that help to determine or influence specificity tend to be highly variable (Padlan et al., 1995). We shall attempt to show that similar observations could be reached without the use of atomic coordinate data. To facilitate the discussion, the important properties of the CDR residues are summarized in Table 1. Also included in the tables are some literature results and our contact analysis using PDB antibody–antigen records and the ligand-protein contacts (LPC) program. The frequency of contacts in the CDRs from the dataset is put in terms of a percent context index. A logical association is to have a residue, as in LV30-32, that makes frequent contacts to also have high entropy and being polar while considered in previous studies as solvent accessible and variable. From Table 1, it is clear that there are many exceptions. Among other reasons, conserved positions are not precluded from binding and nonpolar residues can be important in antigen binding (Burks et al., 1997). In the following discussion, the more nominal patterns in the table will not be identified repeatedly. 4.1. CDR-L1 It appears that position-specific frequency correlates well with canonical analysis of V structures resulting from the significant number of sequences in the IMGT dataset. Of the 13 conserved positions in the light chain first identified by Kabat et al. (1977) 6 of them, LV24-LV27, LV29, and LV33, are in CDRL1. Except for LV29, all of these positions do have low entropy values. Despite having been identified as conserved, LV27 has a relatively high contact index
S. Culler et al. / BioSystems 77 (2004) 195–212
among CDR-L1 residues. The RASQ(S) motif beginning at LV24 is mostly a consequence of structurally conserved positions. The presence of R at LV24 may contribute to the plasticity of the binding site with its flexible side chain (Mian et al., 1991). From the site-specific frequency, S exhibits a somewhat bimodal distribution. This observation can be explained from the notion that in both V and V germlines that AGY serine residues are highly mutated in the CDR regions during the evolution of the germline repertoire. As a consequence, S residues are located at the periphery edges of the binding site (Ignatovich et al., 1997). The highest entropy position LV28 in CDR-L1 also varies in hydrophilicity, likely a consequence of the many contacts that it makes. Most of the high entropy positions in this region also have a propensity to make contacts. One lone exception is LV34; it does not have any contacts. It likely plays a more important role at the HV –HL interface (Vargas-Madrazo and Paz-Garcia, 2003). In contrast, LV33, which is occupied predominantly by L and considered to be a buried and structurally conserved residue, make frequent contacts. From examination of PDB records, it could undergo hypermutation to acquire a polar character and participate in binding. Among the probable antigen contact regions, LV30 and LV31 in the polar patch at LV30-32 are highly variable in both germline and antibody sequences and also considered as where significant somatic hypermutation occurs (Tomlinson et al., 1996). 4.2. CDR-L2 A striking feature is the sparing contacts made by CDR-L2, consistent with the results of MacCallum and Martin (1996). Even so, the two high entropy positions, LV50 and LV53, do make modest contacts. In particular, LV50 was identified as often being in direct contact with antigens (Ramirez-Benitez and Almagro, 2001). These positions also have high entropy in the germline sequences, consistent with the classification of LV50 as a diverse position in V germline sequences and that LV53 as a position with frequent somatic hypermutation (Tomlinson et al., 1996). The low entropy positions all appear to be identified formerly as conserved residues. The polar nature of
209
LV52 could be a result of the hydrogen bond that it forms with LV49 (Chothia and Lesk, 1987). LV54 contains mainly L or R. These residues may serve to increase the plasticity of the antigen binding region. 4.3. CDR-L3 The LV89-95 segment has high contact indexes. They include low entropy positions, LV90 and LV95, that were considered as conserved and canonical, and on the other hand, LV91 and LV93 that were not considered as variable and specificity-determining. Nonetheless, results here are consistent with observations that LV91-LV94 often are in direct contact with the antigen (Ramirez-Benitez and Almagro, 2001), and all these highly variable positions coincide with the probable contact region (MacCallum et al., 1996). The region LV89-93 is predominately polar, resided frequently by residues that can participate in hydrogen bonding. However, LV95 is uniquely occupied by mostly P and L. The presence of P is likely inherited from germline genes where it also predominates. Residues LV91-LV93 are also highly variable in the germline profile, which is consistent with the identification that LV91 and LV92 are diverse positions in the V germline repertoire and that LV93 is a position likely to undergo somatic hypermutation (Tomlinson et al., 1996). Interpretation of the germline entropy is difficult because nearly all of the germline CDR-L3 sequences are seven residues long and quite different from the much more variable antibody sequences. Both LV91 and LV96 that form a cluster in the position dendrogram also participate in HV –HL interface interactions (Vargas-Madrazo and Paz-Garcia, 2003). Even though the presence of the bulky Q at LV89 may prevent LV96 from orienting toward the inner layer, its low antigen contact tendency suggests that LV96 plays a more important role in mediating binding indirectly via the HV –HL interface. 4.4. CDR-H1 Regardless of former classifications as conserved or specificity-determining, the high entropy positions HV27 and HV31-HV35 also have high contact indexes, especially so at HV33 and HV35. The reduced contact at HV28 and HV30 is possibly tied to the
210
S. Culler et al. / BioSystems 77 (2004) 195–212
nonpolar HV29, where it may be buried within the framework structure, packing against the side-chain of residue HV35A (if present) and the main chain of HV72 and HV77 (Chothia and Lesk, 1987). In contrast, HV27, while mainly nonpolar and may lie in a surface cavity next to HV94 and formerly not considered within the antigen contact region (MacCallum et al., 1996), participates in a fair number of antigen interactions. Despite being variable, HV34 is considered structurally conserved in CDR-H1. It is occupied fairly often by M and I. The usage of M, with its long side chain, may serve to impart plasticity to the region. HV35 is unique. It plays a role at the HV –HL interface, but by pointing toward the antigen binding site, it also participates in a significant number of antigen interactions (Vargas-Madrazo and Paz-Garcia, 2003). Residues HV31 and HV33 also have high entropy values in the germline sequences, complementing the classification of HV33 as a diverse position in the germline VH repertoire and HV31 as a position that often undergoes somatic hypermutation (Tomlinson et al., 1996).
4.5. CDR-H2 The contact region in CDR-H2 is localized at the front end LV50-58, consistent with the analysis of MacCallum and Martin (1996). Among these positions, the higher entropy ones (LV52, LV53, LV56, LV58) tend to have higher contact index, and may have contributed toward the same with LV57. In contrast, the nonpolar and buried LV51 makes less contact. Similarly, the two canonical positions in this region, HV54 and HV55, also have slightly lower entropy and contact indexes. These two positions form a cluster in the position dendrogram due to their similar usage of G and S, which consequently define canonical classes 3 and 4 for CDR-H2. The variability of this region is also seen in germline VH sequences where it is the greatest out of the other CDR regions including the light chain. Positions HV56 and HV58 are also high entropy positions in the germline profile, which complements the high probability that both will undergo somatic hypermutation (Tomlinson et al., 1996).
4.6. CDR-H3 This region occupies a central position in the binding site, and functionally, CDR-H3 plays a distinct role in determining antibody specificity. It is much more variable in length and sequence than the other antigen-binding loops. Several mechanisms contribute to its diversity, including selection of HV , D, and J gene segments and alternative splicing (Morea et al., 1998). CDR-H3 also interacts with the VL domain as well as other parts of the HV domain (Vargas-Madrazo and Paz-Garcia, 2003). Most of the residues in CDRH3 may make antigen contacts, with HV96 having the highest contact index. The contact index fell off at HV102, which previously was considered to be just beyond the antigen contact region (MacCallum et al., 1996). Despite the variability of CDR-H3, HV101 has low entropy, consistent with being classified as conserved and canonical. The conserved nature is due to its forming a salt bridge with the side-chains of R at HV94. These two residues are critical for the “kinked” base (Shirai et al., 1996) or bulged torso (Morea et al., 1998) structure of the CDR-H3 loop. While HV95 may interact with the VL domain, it has also been suggested that its side chain can point upward toward the antigen binding site (Vargas-Madrazo and Paz-Garcia, 2003), thus accounting for the relatively high contact index at this position.
5. Conclusions We have shown how statistical patterns using sequence information can help to identify important variable residues that share similarities with results using structural data. However, our work is not meant to replace structural analyses. In fact, our results take on a meaningful interpretation only after a comparison with previous works. Hence, the present work is best viewed as a computational tool that complements structural analyses, especially in situations such as combinatorial studies where atomic coordinate data are not available. Our approach is more expedient than the approach of Lara-Ochoa et al. (1994) where a rank-order distribution has to be explored at each position. To improve binding significantly, both CDR and framework residues need be altered (Boder et al.,
S. Culler et al. / BioSystems 77 (2004) 195–212
2000). It has also been observed that improved affinity maturation of antibodies with affinity in the low nanomolar range occurs most effectively via changes in the “vernier” rather than contact residues (Foote and Winter, 1992; Ramirez-Benitez and Almagro, 2001; Vargas-Madrazo and Paz-Garcia, 2003). Thus, one may identify high entropy positions near antigen contact sites to improve antigen-binding affinity. In addition to narrowing the search space to more probable positions such as those with high entropy, one may take advantage of the position-specific frequencies and plan mutagenesis schemes that are biased toward the more often utilized residues. Further refinement will likely require molecular modeling of an antigen–antibody pair to assess the fitness landscape of binding improvement.
Acknowledgements The authors appreciate the support of the Rajko Medenica Research Foundation for this work.
References Al-Lazikani, B., Lesk, A.M., Chothia, C., 1997. Standard conformations for the canonical structures of immunoglobulins. J. Mol. Biol. 273, 927–948. Almagro, J.C., Dominguez-Martinez, V., Lara-Ochoa, F., VargasMadrazo, E., 1996. Structural repertoire in human VL pseudogenes of immunoglobulins: comparison with functional germline genes and amino acid sequences. Immunogenetics 43, 92–96. Boder, E.T., Midelfort, K.S., Wittrup, K.D., 2000. Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proc. Natl. Acad. Sci. USA 97, 10701– 10705. Burks, E.A., Chen, G., Georgiou, G., Iverson, B.L., 1997. In vitro scanning saturation mutagenesis of an antibody binding pocket. Proc. Natl. Acad. Sci. USA 94, 412–417. Chothia, C., Lesk, A.M., 1987. Canonical structures for the hypervariable regions of immunoglobulins. J. Mol. Biol. 196, 901–917. Collis, A.V., Brouwer, A.P., Martin, A.C., 2003. Analysis of the antigen combining site: correlations between length and sequence composition of the hypervariable loops and the nature of the antigen. J. Mol. Biol. 325, 337–354. Decanniere, K., Muyldermans, S., Wyns, L., 2000. Canonical antigen-binding loop structures in immunoglobulins: more structures, more canonical classes? J. Mol. Biol. 300, 83– 91.
211
Ewert, S., Honegger, A., Pluckthun, A., 2003. Structure-based improvement of the biophysical properties of immunoglobulin VH domains with a generalizable approach. Biochemistry 42, 1517–1528. Foote, J., Winter, G., 1992. Antibody framework residues affecting the conformation of the hypervariable loops. J. Mol. Biol. 224, 487–499. Henikoff, J.G., Henikoff, S., 1996. Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci. 12, 135–143. Ignatovich, O., Tomlinson, I.M., Jones, P.T., Winter, G., 1997. The creation of diversity in the human immunoglobulin V(lambda) repertoire. J. Mol. Biol. 268, 69–77. Kirkham, P.M., Neri, D., Winter, G., 1999. Towards the design of an antibody that recognises a given protein epitope. J. Mol. Biol. 285, 909–915. Knappik, A., Ge, L., Honegger, A., Pack, P., Fischer, M., Wellnhofer, G., Hoess, A., Wolle, J., Pluckthun, A., Virnekas, B., 2000. Fully synthetic human combinatorial antibody libraries (HuCAL) based on modular consensus frameworks and CDRs randomized with trinucleotides. J. Mol. Biol. 296, 57–86. Lara-Ochoa, F., Vargas-Madrazo, E., Jimenez-Montano, M.A., Almagro, J.C., 1994. Patterns in the complementary determining regions of immunoglobulins (CDRs). Biosystems 32, 1–9. MacCallum, R.M., Martin, A.C., Thornton, J.M., 1996. Antibodyantigen interactions: contact analysis and binding site topography. J. Mol. Biol. 262, 732–745. Martin, A.C.R., 2001. Protein Sequence and Structure Analysis of Antibody Variable Domains. In: Kontermann, R., Dubel, S. (Eds.), Antibody Engineering (Springer Lab Manual). Springer Verlag, Berlin, pp. 422–439. Marvin, J.S., Lowman, H.B., 2003. Redesigning an antibody fragment for faster association with its antigen. Biochemistry 42, 7077–7083. Mian, I.S., Bradwell, A.R., Olson, A.J., 1991. Structure, function and properties of antibody binding sites. J. Mol. Biol. 217, 133–151. Morea, V., Tramontano, A., Rustici, M., Chothia, C., Lesk, A.M., 1998. Conformations of the third hypervariable region in the VH domain of immunoglobulins. J. Mol. Biol. 275, 269– 294. Padlan, E.A., Abergel, C., Tipper, J.P., 1995. Identification of specificity-determining residues in antibodies. FASEB J. 9, 133– 139. Ramirez-Benitez, M.C., Almagro, J.C., 2001. Analysis of antibodies of known structure suggests a lack of correspondence between the residues in contact with the antigen and those modified by somatic hypermutation. Proteins 45, 199– 206. Shenkin, P.S., Erman, B., Mastrandrea, L.D., 1991. Informationtheoretical entropy as a measure of sequence variability. Proteins 11, 297–313. Shirai, H., Kidera, A., Nakamura, H., 1996. Structural classification of CDR-H3 in antibodies. FEBS Lett. 399, 1–8. Sobolev, V., Sorokine, A., Prilusky, J., Abola, E.E., Edelman, M., 1999. Automated analysis of interatomic contacts in proteins. Bioinformatics 15, 327–332.
212
S. Culler et al. / BioSystems 77 (2004) 195–212
Steipe, B., Schiller, B., Pluckthun, A., Steinbacher, S., 1994. Sequence statistics reliably predict stabilizing mutations in a protein domain. J. Mol. Biol. 240, 188–192. Tomlinson, I.M., Walter, G., Jones, P.T., Dear, P.H., Sonnhammer, E.L., Winter, G., 1996. The imprint of somatic hypermutation on the repertoire of human germline V genes. J. Mol. Biol. 256, 813–817. Vargas-Madrazo, E., Paz-Garcia, E., 2003. An improved model of association for VH-VL immunoglobulin domains: asymmetries between VH and VL in the packing of some interface residues. J. Mol. Recognit. 16, 113–120.
Voigt, C.A., Mayo, S.L., Arnold, F.H., Wang, Z.G., 2001. Computationally focusing the directed evolution of proteins. J. Cell Biochem. Suppl. Suppl 37, 58–63. Wedemayer, G.J., Patten, P.A., Wang, L.H., Schultz, P.G., Stevens, R.C., 1997. Structural insights into the evolution of an antibody combining site. Science 276, 1665–1669. Winter, G., 1998. Synthetic human antibodies and a strategy for protein engineering. FEBS Lett. 430, 92–94. Wu, T.T., Johnson, G., Kabat, E.A., 1993. Length distribution of CDRH3 in antibodies. Proteins 16, 1–7.