J. Mol. Biol. (1989) 207, 301-310
Consensus Methods for Finding and Ranking DNA Binding Sites Application
to Escherichia coli Promoters Michael C. O’Neill
Department of Biological Sciences University of Maryland Baltimore County Baltimore, MD 21228, U.S.A. (Received 25 July
1988, and in revised form
27 December 1988)
There have been many different, approaches employed to define the “consensus” sequence of various DNA binding sites and to use the definition obtained to locate and rank members of a given sequence family. The analysis presented here enlists two of these approaches, each in modified form, to develop a highly efficient search protocol for Escherichia coli promoters and to provide a relative ranking of these sites showing good agreement with i,n vitro measurements of promoter strength. Schneider et al. have applied Shannon’s index of information content to evaluate the significance of each position within the consensus of a family of aligned sequences. In a formal sense, this index is only applicable to a group of sequences, providing at each position a negative entropy value between zero (random) and two bits (total conservation of a single base) for sequences in which all bases are equally represented. A method for evaluating how well an individual sequence conforms to the information content pattern of the consensus is described. A function is derived, by analogy to the information content of the sequence family, for application to individual sequences. Since this fun&ion is a measure of conformity, it can be used in a search protocol to identify new members of the family represented by the consensus. A protocol for locating E. coli promoters is presented. The Berg-von Hippel statistical-mechanical function is also tested in a similar application. While the information content function provides a superior search protocol, the Bergvon Hippel function, when scaled at each position by the information content, does well at ranking promoters according to their strength as measured in vitro.
1. Introduction
information content function provides a more efficient search protocol. An information content analysis of the promoter database is presented, which suggested that the different promoter spacing classes (defined by the number of bases separating the - IO and - 35 regions) should be maintained as separate databases rather than combined (O’Neill, 1989); this approach was adopted throughout this work.
Many different approaches have been employed in the development of search procedures for DNA binding sites. These include direct matching base by base. consensus frequency tables (Mulligan et al., 1984; Mulligan & M&lure. 1986), frequency ratios (Harr et al.. 1983). and various functions based on one or more of these (Staden, 1984; Studnicxka, 1987). subject in each case to some cutoff criteria. Two approaches which have a theoretical, as opposed to ad hoc, basis are the application of information theory by Schneider et al. (1986) and the statistical-mechanical analogy of Berg & van Hippel (1987). I have modified each of these for use in a search protocol for Escherichia coli promoters. il. modification of the Berg-von Hippel function, which now weights each position in a sequence according t,o the informatJion content of that position in the consensus. does best at ranking promoter strength. However, a derivative of the 0022- “836,‘891100301-10 $03.00/O
2. Methods A series of computer programs were written in PASCAL to implement: (1) the informat,ion content analysis of Schneider et al. (1986) for an input of an aligned family of sequences; (2) the statistical-mechanical analysis of Berg
& von Hippel (1987) on any input sequence: (3) a search protocol. suitable for any input sequence. employing the derived information content function and cutoff criteria described in the text; and (4) a search and ranking
301
0 1989 Academic l’reus Limited
M. C. O’Neill
302
protocol, suitable for any input sequence, employing the information content-factored Berg-von Hippel function described in the text. In contrast to previous studies of the Hawley & XI&lure (1983) promoter database, the promoters of the different spacing classes were not combined into a single database but were maintained as separate databases. The database for the 17-base class includes only t,he bacterial promoters except, in those instances, all concerned wit,h phage promoters, where it is specifically noted for a particular spacing class that the bacterial and phage
promoters have been combined. The information
content
search protocol uses a separate set of criteria for phage promoters. dictated by the phage database alone. In all searches. a mask which begins at a given base and extends over 58 bases defines a sequence that’ is tested under each of the 3 spacing group protocols for the presence of a promoter: thus, all of t,he major spacing class alignments are routinely tested on every sequence under examination.
3. Results and Discussion Schneider et nl. (1986) used an information content index to determine the relative significance of each position within the consensus of a family of aligned sequences. Assuming equal average base frequencies, this index ranks each position in the consensus with a negative entropy value between zero (random) and two bits (total conservation of a single base). Figure 1 shows the information content index for each of the spacing classes (16, 17, 18) of
E. colr promoters (Hawley & McClure, 1983). Also shown are the phage promoters of the 17.base spacing class. There are large scale deviations both within the - 35 and - 10 regions and outside these regions with respect to conservation between spacing classes. Simply put, each spacing class has a distinct consensus sequence. Furthermore, the phage promoters of the 17-base class (the majority) show differences of more than two standard deviat’ions at, approximately one-fourth of the positions when compared to the bacterial 17.base class, suggesting that this spacing class should possibly be subdivided. All consensus values used in this work are, therefore, values based on a particular spacing class as opposed to the combined total, with the phage and bacterial promot,ers combined only where noted. Table 1 shows the resultant frequency table; this Table is based on 18, 21 and 13 promoters, respectively, for the 16. 17 and IS-base spacing classes. These promoters represent all of the fully characterized bacterial promoters from Hawley & McClure (1983), which could unambiguously he assigned to one of t’hr three major spacing groups (all promoters were ultimately test’ed under all 3 protocols in making the assignment). Also shown are the values for the 18 phage promoters of the 17.base class (the only phage class large enough to generate a consensus) and the values for the combination of bacterial and phage promoters of the I7-base class.
(cl 17P
-h--40
50 Sequence
IO
20
30
40
position
Figure 1. Information content profiles for the separate promoter spacing classes. Information content, as defined by Schneider et al. (1986), is plotted for &base sequences corresponding to the consensus of (a) the 18 promoters of the 16-base spacing class; (b) the 21 promoters of the 17-base class; (c) the 18 promoters of the phage 17.base class; and (d) the 13 promoters of the l&base class of the Hawley & McClure (1983) Fig. 1 listing. Negative values (2 -6183). which are due to small sample size, are shown as zero. Error bars indicate + 1 standard deviation. The broken horizontal line indicates the information content average over 58 positions.
General Methods for Analyzing
Tn a formal sense, the information content index employed in Figure 1 is only applicable to a group of sequences. However, it should be possible to develop an analogous function to evaluate how well an individual sequence conforms to the information content pattern of an established consensus. Such a function, as a measure of conformity, might well prove useful in identifying new members of the family represented in the consensus. Schneider et al. (1986) noted that the relative information content at a given position within a specifically aligned set of related sequences in E. coli can be represented as where P is the frequency of 2 + XPi(ln,Pi), occurrence of a particular base at that position within the set wit’h the sum taken over the four bases. As noted above, this function can assume values approximately between 0 and 2 (the actual values for small samples are slightly lower), depending on the degree of conservation at a given position. In an attempt to develop an analogous function which would reflect how well an individual sequence mirrored the information profile of the consensus group, one would like to maintain the same central function and the range of values. An analogous function for an individual sequence that can maintain the same general form and yield the same range (0 to 2 for frequencies between 0.25 and 1) is given as: K]l +((I
-P)ln,P)/C].
Here K is a constant determined by the average base ratios and the sample size (2 for A = T = G = C and large sample; see Schneider et al. (1986)); C is a normalization constant chosen to give a value of 0 for the function when the base frequency is 0.25; P is the frequency of occurrence at a given position within the consensus group of the observed base at the corresponding position in the test sequence. This function. summed over all positions within the sequence, would be used to provide an index of conformity with the consensus profile. An additional problem that must be dealt with is that an individual sequence can have a base that did not occur at all in the consensus group. The lowest frequency possible for any position of a sequence included m the consensus group is l/N; however, in searching outside the consensus group, bases must be evaluated which occur at zero frequency within the consensus group. In this extreme, the function PlnP is not defined. The way in which frequencies below 0.25 are handled will profoundly affect the net score for the sequence, since t’he function PlnP can take on very negative values as P approaches 0. A number of possible substitutions for the P= 0 case (e.g. l/X, l/(S+ 1)) were tested in searches against pRR322 and lambda sequences until one having a statist)ical justification was found to yield superior performance. If, after Rerg & von Hippel (1987). the “Laplace Law of Succession” is used to avoid small sample problems, the base frequency is replaced by the expression [(P + (1 /,V))/( 1 + (4/,‘:))]. where S is again the number of sequences in the consensus group.
DNA
Binding
Sites
Thus the function
303
becomes:
xln2 ((P+(lIN))I(l
+(4/W))/1.5)1.
Figure 2 shows the information content profiles for the consensus of the five PI ribosomal RNA promoters and the three P2 ribosomal RNA promoters. Also shown are the profiles of individual sequences in each group given by the function above. The familial relation is obvious in the patterns to the point that the sequences could be grouped correctly by pattern alone. The function for individual sequences is clearly giving a qualitative index of the sequence’s conformity with the information content of the consensus. As a more quantitative test of the function’s effectiveness, a promoter search protocol employing it was developed, using the Hawley & McClure (1983) promoter database to provide the defining criteria. Under the protocol, all input sequences are tested with three separate test sets, one for each major spacing group; the database and cutoff criteria for each test are provided by the 18 promoters of the 16-base class, the 21 promot’ers of the 17.base class, the 13 promoters of the 1%base class and, in certain cases, the 18 promoters of the phage 17.base class combined with those of the bacterial 17-base class (Hawley & &l&lure. 1983). Each test set: (1) sums the function over 58 positions (position 20 corresponds to the last base of the -35 region); (2) sums over the positions within the -35 and - 10 regions that are conserved at a level at least three standard deviations above average for the class; (3) sums over those non -35/10 positions that are conserved at a level at least a standard deviation above average in a given spacing class; and (4) sums, as a negative indicator, the number of bases having frequencies of l/-1’ or less in the consensus. Each set consists of six evaluations of these sums in different caomhinations. Table 2 presents t,he cutoff criteria determined for each of t’he spacing classes as well as a separate set developed for the phage promoters of the 17-base class. Table 3 presents the search rrlsults on t’he promoter database, on random sequence of varying G+C content: on pRR322 (Sutcliffe. 1978), and on lambda (Daniels et al., 1983). The protocol correct’ly identified all but three of the 52 bact’erial promoters. In random sequence of 50°, G +(‘, it found less than one site per 1000 bases: in sequence of 20% G+C, the frequency rose to about five sites per 1000 bases. In a search of pBR322, the search missed the cyclic AMP-binding protein-dependent P4 promoter and the weak Pl promot’er while tinding the other known promoters and only three additional sites. The search of the ba&eriophage lambda genome was performed in two ways, due to the uncert,ainty as to whether phage and bacterial promot’ers should be pooled. In one trial, the 17-base database was composed only of the 17-base class bacterial promoters; in the other, the 17.base database pooled bacterial and phage promoters. As one might expect, the search employing cutoff
LOC
A. E&-e ,fwquencim I 16 17 18 2 16 17 18 3 16 17 18 4 16 17 18 5 16 17 18 6 16 17 18 7 16 17 18 8 16 17 18 9 16 17 18 10 16 17 IX 11 16 17 18 12 16 17 18 13 16 17 18 14 16 17 18 15 16 17 1x 16 Iti Ii IX
GIJI
(
of the 16. 17 and 0.22 6.17 0.10 0.19 0.23 0.31 0.17 0.22 0.10 0.38 0.15 0.23 0.11 0.33 0.29 0.19 0.23 0.31 0.17 0.50 0.24 0.14 0.46 0.15 0.11 0.0 0.19 0.29 0.23 0.31 0.11 0.06 0.10 0.14 0.08 0.15 0.11 0.22 0.10 0.10 0.08 0.23 0.11 046 0.10 0.05 0.23 0.23 0.17 0.0 0.05 0.29 0.08 0.23 0.11 0.17 0.10 0.10 0.08 0.23 0.0 0.11 0.19 0.14 0.23 0.08 0.22 0.56 0.14 0.24 0.15 0.08 0.44 0.06 0.24 0.10 0.08 0.15 0.11 0.50 0.48 0.38 0.15 0.46 0.11 0.0 0.14 0.0 0.15 0.15 0.06 0.06 0.05 0.10 0.0 om
<:
T
Lot
G
l&base promoter swciw classes 0.89 0.22 17 ‘0.39 0.76 0.33 0.38 0.69 0.15 0.31 0.0 0.28 18 0.33 0.05 0.24 0.29 0.0 0.46 0.15 0.17 0.22 19 0.33 0.05 0.24 0.29 0.0 0.38 0.08 20 0.17 0.17 0.17 0.24 0.33 0.29 0.15 0.08 0.31 21 0.61 0.28 0.28 0.29 0.24 0.10 0.38 0.08 0.0 22 0.67 0.17 0.22 0.52 0.24 0.24 0.69 0.08 0.15 23 0.44 0.44 0.22 0.57 0.24 0.33 0.46 0.0 0.23 0.33 24 0.22 0.50 0.38 0.48 0.24 0.38 0.15 0.15 25 0.27 0.56 0.33 0.33 0.19 0.33 0.15 038 0.31 0.44 26 0.28 0.39 0.52 0.29 0.19 0.54 0.15 0.23 0.72 0.17 27 0.33 0.24 0.29 0.38 0.23 0.46 0.31 28 0.22 0.06 0.17 0.19 0.43 0.29 0.08 0.31 0.46 0.28 29 0.22 0.50 0.19 0.24 0.48 0.31 0.46 0.15 30 0.33 0.06 0.44 0.10 0.05 0.19 0.15 0.23 0.23 0.11 0.0 31 0.89 0.0 0.24 0.86 0.08 0.62 0.0 0.06 0.83 32 0.11 0.05 0.24 0.81 0.1.5 0.08 0.85
A 0.0 0.14 0.08 0.17 0.38 0.15 0.61 0.57 0.54 0.06 0.0 0.31 0.39 0.10 0.31 0.28 0.19 0.15 0.22 0.19 0.38 0.22 0.29 0.15 0.22 0.14 0.15 0.06 0.14 0.08 0.33 0.33 0.23 0.22 0.33 0.23 0.11 0.29 0.23 0.11 0.19 0.08 0.0 0.24 0.23 0.28 0.14 0.23
C” 0.0 0.0 0.0 0.56 0.48 0.54 0.17 0.29 0.15 0.44 0.52 0.23 0.17 0.38 0.23 0.28 0.38 0.31 0.28 0.10 0.08 0.22 0.19 0.15 0.28 0.24 0.31 0.22 0.38 0.23 0.17 0.19 0.23 0.28 0.19 0.15 0.06 0.19 0.15 0.33 0.38 0.54 0.67 0.29 0.54 0.33 0.29 0.23
A
Table 1
0.11 0.10 0.23 0.28 0.10 0.31 0.06 0.10 0.31 0.33 0.24 0.31 0.17 0.43 0.46 0.22 0.19 0.38 0@6 0.38 0.54 0.33 0.29 0.54 0.17 0.43 0.38 0.33 0.29 0.46 0.17 0.24 0.23 0.28 0.19 0.54 0.33 0.29 0.46 0.11 0.24 0.15 0.22 0.24 O-23 0.28 0.33 0.38
T
Base frequencies
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
LCW
0.22 0.05 0.15 0.22 0.29 0.08 0.17 0.29 0.31 0.39 0.14 0.15 0,06 0.29 0.46 0.0 0.14 0.38 0.06 0.0 0.0 0.06 0.19 0.0 0.11 0.19 0.23 0.0 0.19 0.38 0.67 0.0 0.15 0.17 0.24 0.0 0.56 0.19 0.08 0.06 0.33 0.23 0.06 0.29 0.23 0.17 0.24 0.08
Q 0.06 0.29 0.15 0.44 0.14 0.15 0.50 0.24 0.15 0.33 0.24 0.15 0.06 0.24 0.23 0.0 0.14 0.23 0.06 0.05 0.15 0.06 0.14 0.0 0.06 0.14 0.23 0.06 0.24 0.23 0.11 0.0 0.23 0.72 0.14 0.08 0.11 0.33 0.08 0.72 0.38 0.23 0.56 0.24 0.15 044 0.43 0.23
(: 0.11 0.24 0.31 0.17 0.19 0.15 0.22 0.19 0.38 0.17 0.29 0.15 0.0 0.24 0.08 1.0 0.0 0.0 0.22 090 0.08 0.67 0.29 1.0 0.78 0.67 0.08 0.0 0.43 0.31 0.17 0.05 0.31 0.06 0.33 0.0 0.28 0.29 0.38 0.17 0.14 0.08 0.17 0.38 .08 0.17 0.24 0.46
A 0.61 0.43 0.38 0.17 0.38 0.62 0.11 0.29 0.15 0.11 0.33 0.54 0.89 0.24 0.23 0.0 0.71 0.38 0.67 0.05 0.77 0.22 0.38 0.0 0.06 0.0 0.46 0.94 0.14 0.08 0.06 o-95 0.31 0.06 0.29 0.92 0.06 0.19 0.46 0.06 0.14 0.46 0.22 0.10 0.54 0.22 0.10 0.23
T
58
57
56
55
54
53
52
51
50
49
LOC
0.11 0.38 0.38 0.17 0.10 0.23 0.17 0.19 0.15 0.06 0.10 0.0 0.39 0.14 0.08 0.56 0.19 0.08 0.22 0.29 0.31 0.22 0.33 0.15 0.22 0.29 0.23 0.22 0.10 0.08
G 0.72 0.48 0.15 0.50 0.43 0.46 0.33 0.43 0.31 0.33 0.29 0.31 0.28 0.10 0.08 0.22 0.33 0.31 0.22 0.33 0.38 0.50 0.19 0.15 0.17 0.14 0.08 0.39 0.33 0.23
c 0.17 0.10 0.23 0.22 0.24 0.15 0.28 0.19 0.46 0.11 0.38 0.38 0.0 0.43 0.23 0.11 0.29 0.15 0.39 0.14 0.23 0.22 0.14 0.46 0.39 0.19 0.0 0.22 0.24 0.23
A
0.11 0.19 0.46 0.17 0.24 0.08 0.06 0.33 0.23 0.22 0.38 0.69 0.11 0.33 0.46
042
0.0 0.05 0.23 0.11 0.24 0.15 0.22 0.19 0.08 0.50 0.24 0.31 0.33 0.33
T
.c E e
306
M. C. O’NeilE
t
t
ribo P2(3)
ABP2
I
1
DEX P2
60
0
IO
20
30
40
50
60
Sequence
position
Figure 2. Information content profiles for ribosomal RNA promoters. The information content profiles of the consensus of the 5 PI promoters and of the 3 P2 promoters of the Hawley & McClure (1983) listing are shown in the top 2 panels; error bars correspond to + 1 standard deviation. Below these are shown the profiles of the individual members of the same family; these profiles for individual sequences are produced by the function derived, by analogy with the information content function, for individual sequences. For the purpose of comparisons over the 0 to 2 range, negative values are truncated to zero. criteria generated by the phage database produced a much more specific result; it identified all of the major lambda promoters, except APRE, with only 12 additional sites in a search of 97,000 bases. The majority of the additional sites come from the highly A+T-rich h region (Collins & Coulson, 1984). It is perhaps worth digressing at this point to note that, in the absence of a perfect description of the binding site of interest, all searches are perforce ad hoc in nature. Regardless of the function employed, it is always necessary to set cutoff values in an essentially arbitrary fashion. If one sets the cutoff so as to minimize false negatives, i.e. t,o
capture all known sites, the result is generally an unacceptable level of false positives when real biological targets are tested (cf. Mulligan & McClure, 1986). In the work presented here, false negative levels up to one-third were accepted in order to reduce substantially the level of false positives (extending the cutoffs in the preceding search to include pBR322’s P4 promoter increases the number of positives from 3 and 2 to 44 and 66 in the 17-base group). Whereas the information content formalism of Schneider et al. (1986) was developed to examine a group of sequences and is not here rigorously
General Methods for Analyzing
DNA
Binding
search
protocol
Sites
307
Table 2 Criteria Function
for the promoter
Definition
%JV’,) ~fl(f’,),, ~fflUJi)17 ~f,Cf’i),S EfZ(f)i)lh ~fdf’Jl7 V2V’J1* V3(f’i)lh
Total over 58 Sum positions Sum positions Sum positions Sum positions Sum positions Sum positions Sum posit,ions 46, 49 Sum positions Sum positions .f 1 +f* .fl ff3
~fdf’i),, Xsj(f’i)lS
bases 15, 16. 17 15. 16, 17 16, 17 37. 38, 42 38, 39, 41. 43 39. 40. 44 5, 6, 11. 31. 39, 40. 41, 44,
fl
+.f2
I8
2 !).74
342
Il.19
12.0
2 3.83 2 2.76 2 2.78 2 5.03 14
4.77 0.43 -3.21 1.95 4
4.0 -0.32 3.31 6.35 1
1.18 1.97 3.94 5.49 6
7, 8, 14. 18. 19, 20, 49 6. 18. 19, 23, 30, 52, 56
f* +f3
Zf4(Pi< l/-V)
Cutoff criteria 17 l7P
16
+s3
Total 58 bases
Input sequence is subjected to a series of 6 tests for each of the 3 major spacing classes. Each spacing class has specific criteria for each of the 6 tests. To be scored as a promoter, a region of the sequence must pass all 6 tests for one of the spacing classes. The functions defined as f,, through f3 all sum the information index over those positions indicated in the Definition column. The function f4 is simply a counter of base frequencies which occur 0 or l/N times in the consensus. The tests employed are, for each 58.base segment: (1) the sum of the index over 58 positions; (2) the sum of the index for the most conserved of the -35 and -10 region positions; (3) the sum of the index over the most conservrd - 35 positions and the most conserved positions outside the - 35 and - 10 regions; (4) t,he sum in (3) with - 10 positions substituted for -35 positions; (5) the sum of the index over all posit,ions conserred at a level at least a standard deviation above average (see Fig. 1); (6) a count of the owurrt=nce of bases having 0 or l/N frequencies in the consensus. For the first 5 tests, the index sum must be greater than or equal to the specified cutoff; in the last test. the count must be less than or equal to the limit specified.
Table 3 ICjfectiveness
of the promoter
l’BK322 (c) pURB%% (cc) Lambda (r) Is Lambda (1)-l% Lambda (r)-Pt Lambda (1 )pP
protocol
Number and class of “promoters”
Target sequence Promoters of the Promoters of the Promoters of the Random (X)4,, (i Random (2O”o (:
search
16.base class (18) 17.base class (21) 18.base class (13) + C) (20,000 bases) +C) (20,000 bases)
16 “0 13 16-17 l-16 79-17 31-18 3-17 l-18 2-17 l-18 23-17 5-18 l-16 21-17 l-18 4-17 5518 l-16 i-17 l-18
on various
target
sequences
Target co-ordinates N.A. N.A. N.A. N.A. N.A. 3650, 4130.4359 4928 3137, 4238 4259 S.A. S.A. N.A. N.A. N.A. 23714,37974.44538,47095 8916, 23182, 23216. 23335, 35110 24697 21993, 24083,U, 29676, 35631. 37989. 38725 24084
The information content index for individual sequences was used in accordance with the criteria of Table 2 to search input sequences for promoters. The input sequences included the promoter databases themselves. random sequence of varying base ratios, and the complete sequences of pBR322 and phage lambda. The underlined co-ordinates indicate the location of known promoters; these are in order of appearance: tet. RNAl, primer, and bla in pBR322 and Pa, P,‘, P,, P,, Pa,, PO of lambda. t This search used a combined bacterial and phage database for the 17-base spacing class with the criteria listed for 17p in Table 2; all other searches used bacteria1 databases and cutoff criteria. (c). Clockwise search; (cc), counter-clockwise search.
M. C. O’Neill
308
extended to individual sequences, the approach of Berg and von Hippel was specifically aimed at placing the assessment of individual sequences on a theoretical basis. The discriminator function which they derived is:
lnl(Pmax+ l/W(Pobs+ l/WI, in which P,,,,, is the highest base frequency in the consensus at a given position, PObs is the consensus frequency for the base observed in the test sequence at that position, and N is the number of sequences used to determine the consensus; this function, summed over the sequence, gives an index of conformity. The higher the index is, the lower the conformity of the sequence with the consensus. (It is interesting that this index effectively combines two of the earlier ad hoc functions applied to this problem, the product of the frequency ratios of Harr et al. (1983) and the sum of the logarithms of the frequencies of Staden (1984).) This index applies the same weighting criterion to every position in the sequence. As Berg and von Hippel note, it does not ideally handle possible irrelevant positions within the sequence in that, due to sampling considerations, these will contribute to the index whereas they should not. There is also a possible problem in that all positions share the same maximum value, ln[N + 11, which is approached gradually by way of ln[P,,,(N)+ 11; this may not’ adequately describe a situation in which there are severe pomt mutations. (These difficulties may be responsible for the fact that this approach does no bett’er than that of Mulligan et al. (1984) in ranking promoters; on the other hand, it is almost the equal of the modified information content function in its promoter search effectiveness (data not shown).) In an attempt to shade the position weighting, I factored the Berg-von Hippel function with the information content of each position, since the information content is the theoretical weight’ of each position within a consensus sequence. This automatically discard would. for instance. irrelevant (random) positions from the analysis; and it would place a still higher emphasis on highly conserved positions with a steeper approach to the maximum (worst) value of the index at each position. The discrimination function for each position in the sequence thus becomes: {(K+CPi
same test samples employed above with the information content function. this test produced “hits” on random sequence about 2.5 times more frequently, and on lambda about six times more frequently (data not shown) than did the information content function above. These levels were obtained after the cutoff criteria had been increased in stringency to the point that only 45 of the 52 bacterial promoters were correctly identified by the test. Thus it appears that, by itself, the modified Berg-von Hippel function provides insufficient discrimination to serve as an effective promoter search function. With ancillary tests of the sort used above with the information content fun&ion. this performance might well be improved but, such tinkering would appear to be without theoretical justification. When this function was used in an attempt, to rank promoters in strength, as defined in vitro by K,lc, values (Mulligan et al., 1984), the results were somewhat more encouraging. The six wild-type promoters from their list belonging to the 17.base class were tested first. In that all of t,he 17-base spacing class promoters which Mulligan et al. (1984) listed are phage promoters, the database for this class was again comprised of all bacterial and phage promot’ers of the 17-base class (N = 39). When the
9 8
I
ln,Pi)(lnl(~‘,,,+l/N)/(P,,,+l/N)])}.
Here K is a constant determined by the genomic base ratios and the sample size (Schneider et al., 1986). The summation in t’he first term is over the four bases for the base frequencies, P. provided by the consensus at that position. Pmax is the frequency of the most common base at t’hat position in the in the consensus of consensus: POobsis the frequency the base observed in the t’est sequence at that position: and N is the sample size used to form the consensus. summed over 58 positions as This function, described above, was used as the single test in a promoter search program. When run against the
I 0
I I
, 2 Promoter
I 3
I 4
1 5
I
index
Figure 3. Correlation of the modified Berggvon Hippel index with promoter strength irL htro. Log(K,k,) values from Mulligan it al. (1984) are plotted against the modified Berg-von Hippel index described in the text. (a) The values for the 8 wild-type promoters (T7Al. T’7A3. APR. T7D, T7C, IPRM. kc+ and T7A2) of the Mulligan listing are shown. A least-squares fit line is drawn: the correlation
coefficient is 0.97. (b) The values
for IPR x3. IPRM E37, lPRM E93, lPRM El04 and IPRM up-l have been added to t,hose for the 8 wild-type promoters: the rorrelation coefficient is 0.94.
General Methods for Analyzing
DNA
Binding
Sites
309
Table 4 Promoter
Promoter P22ant str his iPK 434PR galp” IPL T7.41 TSA3 1PR rpoA UWl~P2 UC7B P 1 T7D mnAI3Pl
hioA rnnGP1 P22mnt fufB
trpR mnABP2
hioU mnJi!PI IPO
ChWS
index
Promoter
171, 17 18 17p 17p 18 17p 17P ‘7P 17P 17 18 17 17P 16 18 16 17P 17 IX 16 17 I6 lip
042 0.76 0.95 0.98 1.15 1.17 1.P2 1.26 1.31 1.39 1.39 1.63 168 1.75 1.83 1.86 1.96 I .97 2.08 2.24 2.26 2.28 2.32 2.33
SUpBE mnGP2 434PRM alas MlRNA rnnXP1 T7C mnDP 1 lacP1 gluS CWaBAD mnDEXP2 l&RNA IWET ilvGEDA deoP1 hisJ fdX rp1.J tr?, ler4 IPP lPRM SlO
rankings
Class
Index
Promoter
17 16 17P 18 17 I6 17P IG 18 17 18 16 16 17 18 18 18 17P Ii 17 17 18 ‘7P 16
2.35 2.39 2.40 2.45 2.47 2.50 2.51 2.60 2.73 2.90 2.99 3.02 3.03 3.07 3.36 3.46 3.53 3.83 3.88 3.95 3.97 4.06 4.07 4.07
fol tyrT trpp2 araC WA qoT42 uvrBP3 W-R W-L SPC thr a??lpC hC1
P22PRM rrcA IPI aroH iPRE tnaA m&K malEGF rpoB
Class 17 16 17 17 17p 17 17 18 17 17 17 16 17 1;p 16 17p 16 lip 16 16 16 16
Index 4.09 4.25 4.26 4.41 4.54 4.94 5.00 5.02 5.75 5.96 6.64 6.82 7.13 7.16 7.28 7.34 8.01 8.82 9452 13.05 14.07 16.07
The modified Urrg-van Hippel index described in the text was summed over 58 positions in each case. Seventy bacterial and phage promoters were ranked by the index sum (lowest value = strongest promoter). The database for the 17.base class combined the bacterial and phage promoters
logarithm of K,k, is plotted against the promoter index, the correlation coefficient for these promoters was 0.99 and the ranking order was perfect. The last two wild-type promoters on their list were lac+ and T7A2, both of the 1%base class. Since there are only three phage promoters of this class, these were combined with the bacterial promoters of the same class to make the database for determining an index for T7A2. Figure 3(a) shows the result for all eight, of the wild-type promoters. The correlation coefficient here is 0.97, whereas the index of Mulligan et al. gives 0.89 and that of Berg and von Hippel gives O+C3for the same subset. Only one of these promoters, T7A2, is slightly at odds with the ordering suggested by K,,kZ values. If this group is extended to include all of the remaining 17-base class promoters on the Mulligan list, excepting the weakest member (APRMl16), the correlation coefficient is 0.94 for 13 promoters (Fig. 3(b)). (The APRM116 promoter is correctly ranked last but is predicted to have much less than the observed strength.) This substantial improvement in correlation does not carry over to the remaining 1%base spacing class promoters of the Mulligan et al. listing; however, this may be because of the narrow sample, almost all of these promoters are Zac derivatives, and/or the relatively small (13) database. Although a database with 13 sequences would generally be considered large for a particular DNA binding site, t’he proper perspective depends on t’he total number of such sites. Some probability considerations illustrate potential limitations in this method. Suppose that in the complete set of
binding sites (N >>13), a base is represented at a particular position with a 0.25 frequency; what are the chances that, in a subset of 13 sequences, the observed consensus frequency for the base at that position would be 02 The answer, 0.0237 (Dykes et al., 1975), applied to a 58-base sequence, yields a probability of 0.75 that one or more “false” zeros will appear in the consensus frequency table for a sample that small. Since Pobs = 0 carries a strong negative weight, this can result in a disproportionate shift in the index. A possible hedge against this problem would be a second-order application of the “Laplace Law of Succession”. setting Pobs = l/(N +4) whenever Pobs is found to be 0 in small samples. This is not a problem for the much larger database of the 17-base class. As additional data on K,k, values become available, it should be possible to judge whether this function is highly effective at ranking promoters. Table 4 presents a listing of the bacterial and phage promoter rankings for each spacing class, generated by the function, together with the index sum for each promoter. The information content function does not generate the same ranking order (data not shown). For the present, it remains somewhat paradoxical that a function which appears to do a very good job of ranking promoters is not very efficient at finding promoters and viceversa. Although these protocols have been applied here only to the database of E. coli promoters, the procedures are entirely general and can be employed with any family of sequences, DNA,
M. CT.O’NeiEl
310 RNA or peptide,
for which there is a database of reasonable size. What constitutes “reasonable size” can best be determined by analyzing the database with the information content test of Schneider et al. (1986), implemented with the statistical tests that the authors provide.
References Berg, 0. G. & von Hippel, P. H. (1987). J. Mol. Riol. 193. 723-750. Collins, ?J.F. & Co&on, A. F. W. (1984). &cl. Acids IZes. 12, 181-192. Daniels, D. L., Schroeder, J. I,.. Szybalski. W.. Sanger, F. & Blattner, F. R. (1983). In Lambda II (Hendrix. R. W., Roberts, J. W., Stahl, F. W. & Weisberg. R. A. eds). pp. 469-517. Cold Spring Ha,rbor Laboratory Press, Cold Spring Harbor. NY.
Edited
Dykes, G., Bambara. ,Vucl. Acids
R.. Marians, K. Br Wu, R. (1975).
Res. 2. 327-337.
Harr, R.. Haggstrom, M. &, Gustafsson, I’. (1983). ~Vucl. Acids Res. 11, 2943-2957. Hawley. D. K. & McClure. W. R. (1983). XI&. Ar:id.s Ras. 8. 2237~-2255.
by I-) won Hippel