A Procedure for Identifying Loosely Conserved Protein-Binding DNA Sequences

A Procedure for Identifying Loosely Conserved Protein-Binding DNA Sequences

[20] finding degenerate protein-binding sites in DNA 229 [20] A Procedure for Identifying Loosely Conserved Protein-Binding DNA Sequences By Michae...

91KB Sizes 1 Downloads 72 Views

[20]

finding degenerate protein-binding sites in DNA

229

[20] A Procedure for Identifying Loosely Conserved Protein-Binding DNA Sequences By Michael C. O’Neill There are different types of searches for protein-binding DNA sequences. In some cases, one knows only a region of sequence that contains somewhere within it the desired binding site of an unknown character, e.g., most eukaryotic promoters. Given several such regions, it may be possible to determine the sequence using a search method that looks for improbable levels of local complexity in the DNA. The more common type of search is typical of a more advanced stage of knowledge in which case one has several examples of known binding sites for a particular protein and wishes to find other members of that sequence family. This latter case is discussed here. Searching by Homology

The starting point for this search procedure is, therefore, an aligned set of six or more examples of a particular sequence motif. What information can be drawn from this collection? With the sequences aligned, it is simple to determine the consensus base at each position, allowing an immediate homology search with the consensus sequence. If the binding sequences are tightly conserved, say like the lambda operator set, this may be sufficient. (Indeed, if there is only one example of the desired sequence, one is limited to some type of homology search.) However, given that much of the regulatory range of a given binding protein is supplied by deviations in the binding sequence away from the optimum sequence, homology searches often bring inadequate information to the search task; the result is that criteria allowing the capture of the known sites result in very high false-positive levels. Motif Redundancy Measured by the Reduction in Shannon Entropy

What other information does our sequence collection provide? It gives us the distribution of bases at each position within the motif. A consensus sequence search cannot be an optimal search because it has discarded this information. Staden1 and Mulligan et al.2 were among the first to employ 1 2

R. Staden, Nucleic Acids Res. 12, 505 (1984). M. E. Mulligan, D. K. Hawley, R. Entriken, and W. R. McClure, Nucleic Acids Res. 12, 789 (1984).

METHODS IN ENZYMOLOGY, VOL. 370

Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00

230

analyses of promoter and transcription patterns

[20]

this information in a distribution-weighted search using base frequencies at each position. A decade earlier, Gatlin3 had suggested that some of the work that Shannon had done on string transmission in noisy communication channels might well be applied to the strings of the genetic code. This suggestion, made before any DNA regulatory sequence was available, lay dormant until the work of Schneider et al.4 in which the authors applied Shannon’s entropy function to the characterization of binding sequence families. The function that they proposed was RI ¼ Sbg Spmotif

(1)

which is the difference between the Shannon entropy of the background and the Shannon entropy of a given position within the motif, to be calculated at each position within the motif. Here the observed frequency of a particular base at a given position of the current sequence collection is taken as the probability, pb, that the base will occur at that position in members of this family. For a running example, a family of six three base sequences will be employed: AGT AAT GGT AGC CAT AGA for equiprobable bases, A¼C¼G¼T in the genome Sbg Sp ¼ ð4  0:25ln2 0:25Þ      pa ln 2 pa þ pc ln 2 pc þ pg ln 2 pg þ pt ln 2 pt ¼ 2þ

X

pb ln2 pb

(2) (3)

b

where pb is the frequency of the base in question drawn from the sequence collection at the position being calculated. position 1 ¼ 2 þ 0:67ð0:578Þ þ 0:167ð2:59Þ þ 0:167ð2:59Þ þ 0ð0Þ (4) ¼ 0:732 bits 3

(5)

L. L. Gatlin, ‘‘Information Theory and the Living System,’’ Columbia Univ. Press, New York, 1972. 4 T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht, J. Mol. Biol. 188, 415 (1986).

[20]

finding degenerate protein-binding sites in DNA

position 2 ¼ 2 þ 0:33ð1:6Þ þ 0:67ð0:578Þ þ 0ð0Þþ0ð0Þ ¼ 1:068bits

231 (6) (7)

position 3 ¼ 2 þ 0:167ð2:59Þ þ 0:167ð2:59Þ þ 0ð0Þþ0:67ð0:578Þ ¼ 0:732bits

(8) (9)

for skewed base ratios " Sbg Sp ¼

X

# pb ln2 ð pb =fb Þ

(10)

b

where fb is the genomic base ratio for the base in question. This function, Eqs. (1) and (10), represents the loss in Shannon entropy moving from background DNA to the fixation of the motif as represented in the current sequence collection. This function is determined for each position and assumes positional independence (linearity) for these regulatory regions. It provides a particular metric for averaging the base information at each position of the motif. Schneider et al.4 called it information content, but as it is the difference of two information contents, it is better named an index of redundancy (RI). Because sequence conservation requires the constant application of energy for maintenance, redundancy can reasonably be equated with importance. Thus the redundancy index gives us the relative importance of each position across the motif. Schneider et al.4 also suggested that the sum of the RI across the whole motif provided an estimate of the frequency of occurrence of motif members within a target sequence, taken as random:   (11) Freq ¼ ðtarget sizeÞ= 2 RI for our example ¼ ðtarget sizeÞ=22:53 ¼ ðtarget sizeÞ=5:8

(12)

What can be learned from this? Figure 1 shows the RI profile of a collection of 26 CRP-binding sequences from Escherichia coli. Because E. coli has equiprobable base ratios, the maximum RI value is 2 (this maximum is somewhat reduced for small sample size) representing total conservation at a position, and the lowest value is 0, representing background base frequencies at that position. We see that (a) total conservation is not seen at any position, (b) there is substantial variation in the relative importance of positions within the motif, (c) some internal positions are random, and (d) despite being an inverted repeat sequence, the symmetry within the collection is, like many other inverted repeats, quite unbalanced with a strong

232

analyses of promoter and transcription patterns

[20]

Redundancy index

2

1

Position in motif Fig. 1. Redundancy index profile. The redundancy index is plotted for the 22 positions of the CRP motif based on the base frequencies of 26 CRP sites in E. coli.

half and a weak half. Does this bring us any closer to the goal of being able to search for motif members? It does not appear to since it is only the sequence group which is characterized; there is no link here between the ideal motif profile and how well individual sequences of the family fit that ideal. Functional Ranking of Individual Sequences

Various functions have been proposed to fill this gap and thus provide a rank ordering of the family members. If one can do a rank ordering, the only additional requirement to do a search is a cutoff value for acceptance. Berg and von Hippel5 first proposed the function    (13) Index ¼ log popt þ 1=N =ð pobs þ 1=N Þ where popt is the consensus base in the current collection at the position being calculated, pobs is the frequency at which the base found in the query sequence is found in the equivalent position of the current collection, and N is the number of examples in the current collection. This value would be determined independently at each position and summed over the whole sequence, the lower the value the better. This index discards information about at least two base frequencies at each position and so cannot be optimal. It is a measure of how well the position is filled without reference to how important the position is within the motif, a shortcoming duly noted 5

O. G. Berg and P. H. von Hippel, J. Mol. Biol. 193, 723 (1987).

[20]

finding degenerate protein-binding sites in DNA

233

by the authors. Consider two frequency distributions for a position: A¼0.5 C¼0.45 G¼0.05 T¼0 and A¼0.3 C¼0.27 G¼0.21 T¼0.22 with N very large in both cases. The sequence being evaluated starts with C. In the first case, we have log[0.5/0.45] or log[1.11]; in the second case we have log[0.3/0.27] or log[1.11]. In both cases the Berg von Hippel function gives the same answer for a C in the query sequence. However, in the first collection, C is one of a highly conserved pair of base choices for this position and in the second collection C is a member of an almost random distribution at this position. Clearly C should not carry the same weight in the two instances. If there were a way to factor in the positional importance, this situation might be remedied. Another function that has been proposed by various investigators at various times,1,6–8 but each time without any proof of its efficacy, is Index ¼ ln 2 ð pb =fb Þ (14) where pb and fb are as defined earlier. As before, this function would be determined for each position of the query sequence and summed over the whole sequence. This index discards base frequency information for the other three bases at each position; it cannot, therefore, be optimal. In E. coli, a background choice would be 0; a totally conserved base at a position would score a 2. However, the function blows up for any base choice that is unrepresented in the current collection; while this tends to keep the false-positive level low, it has the effect that no ‘‘new’’ variants can ever be found. Again consider two different frequency distributions corresponding to different sequence collections: A¼0.60 C¼0.40 and A¼0.20 C¼0.40 G¼0.20 T¼0.20. The base being evaluated is C. It gets the same ranking value in each case, but in the first case it is one of two bases in a conserved position that has an RI of 1.03 bits and in the second case it is one of four bases in a nearly randomized position having an RI of 0.08 bits. If tested on the same data sets used by Berg and von Hippel,5 this index underperforms the Berg von Hippel index (unpublished data9). None of this is encouraging in our effort to develop an accurate search procedure. However, the hint of the way out was given earlier. To achieve an accurate ranking, one must provide two pieces of information for each position of the sequence. The first is how well is the position filled with respect to the motif distribution and the second is how important is that position within the motif? One might think that these are one and the same, 6

P. Bucher, J. Mol. Biol. 212, 563 (1987). G. D. Stormo, Methods Enzymol. 208, 458 (1991). 8 T. D. Schneider, J. Theor. Biol. 201, 87 (1999) 9 M. C. O’Neill, unpublished data. 7

234

analyses of promoter and transcription patterns

[20]

but the frequency examples given earlier show that this is not true. The consensus base at a highly conserved position should receive a high weighting, whereas the consensus base at a randomized (RI near 0) position should not. The Berg von Hippel function, Eq. (13), lacked a sense of positional importance; however, that is precisely what RI purports to measure. Because each function is evaluated independently at each position, it is feasible to factor them. As was shown in earlier work,10 the product (RI*BvH) summed across the sequence is a very effective ranking function that, when combined with a cutoff value, can also be used as a free-standing search function. Let us return to the six-sequence example, supposing our query sequence to be CAT. RI*BvH for Position 1 ¼ 0:732  log½ð0:67 þ 0:166Þ=ð0:166 þ 0:166Þ ¼ 0:29

(15)

Position 2 ¼ 1:07  log½ð0:67 þ 0:166Þ=ð0:33 þ 0:166Þ ¼ 0:24

(16)

Position 3 ¼ 0:732  log½ð0:67 þ 0:166Þ=ð0:67 þ 0:166Þ ¼ 0:0

(17)

total ¼ 0:29 þ 0:24 þ 0 ¼ 0:53

(18)

Similar considerations led to a second ranking function that performs about as well as the RI*BvH function. If RI is determined for a given position of the sequence collection, it can be designated RI. If the sequence to be evaluated is then added to the sequence collection and RI is redetermined, an additional A,C,G, or T has been added at that position. The new RI can be designated RIþ. The difference RIþ -RI is a measure of how well the position has been filled in the query sequence. If the new base is better than the average base at that position in the sequence collection, the difference will be positive; if it is worse than average, the difference will be negative. Again the measure of positional importance is just RI, the value for the current collection. This suggests ðRIRIÞ ¼ RI ðRIþ RI Þ

(19)

as a ranking function to be determined at each position of the query sequence and summed across the whole sequence. Its effectiveness has earlier been shown to be very similar to the RI*BvH function.11 Again, if we work through our example with the query sequence CAT, RIRI for position 1 ¼ 0:732  ð2 þ 0:57  ln2 0:57 þ 0:29  ln2 0:29 þ0:14  ln2 0:14 þ 0:00:732Þ ¼ 0:08 10 11

M. C. O’Neill, J. Mol. Biol. 207, 301 (1989). M. C. O’Neill, Proc. Natl. Acad. Sci. USA 95, 10710 (1998).

(20) (21)

[20]

finding degenerate protein-binding sites in DNA

235

2 ¼ 1:07ð2 þ 0:43  ln2 0:43 þ 0:0 þ 0:57  ln2 0:57 þ 0:01:07Þ

(22)

¼ 0:05 3 ¼ 0:732  ð2 þ 0:14  ln2 0:14 þ 0:14  ln 2 0:14 þ0:0 þ 0:71  ln2 0:710:732Þ

(23)

¼ 0:09 total ¼ 0:04

(24) (25) (26)

Both of the aforementioned functions, Eqs. (14) and (19), leave room for improvement when used as search functions in that the false-positive level can be 0.1% (i.e., percentage of total bases searched). Sequence Analysis with Neural Networks

A very different approach to sequence identification is provided by the use of backpropagation neural networks12,13 to find additional members of an existing sequence collection. In the simplest form, one has merely to code the sequence into binary form, e.g., A¼0001 C¼0010 G¼0100 T¼1000. The network is then trained by supplying coded examples of the sequence type sought along with the desired answer, 1, for yes this is a family member or 0, for this is not a member. Random sequence and double-down mutant sequences can be used for negative inputs. During training the network will learn to distinquish the two classes of input sequences. If the training sample has been sufficiently representative, the network can then be used to evaluate a sequence it has not seen before to find other members of the family. If one wished to search the E. coli genome, the coding program would set a window equal to the length of the query sequence to code that length at the beginning of the genome sequence as the first test example; the program would then shift forward one base to code the second test example and so on until the genome sequence had been exhausted (note that when coding genomes the file size limit of 32bit processors can be exceeded easily). It has been shown earlier that this can be a highly effective search procedure.14,15 In the case of a very difficult sequence family, such as the  70 promoters of E. coli, it was estimated to find about 85% of true promoters with a false-positive level on the order of 0.1%. 12

P. J. Werbos, in ‘‘The Roots of Backpropagation.’’ Wiley, New York, 1994. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, in ‘‘Parallel Distributed Processing: Explorations in the Microstructures of Cognition,’’ p. 318. MIT Press, Cambridge, MA, 1986. 14 M. C. O’Neill, Nucleic Acids Res. 19, 313 (1991). 15 M. C. O’Neill, Nucleic Acids Res. 20, 3471 (1992). 13

236

analyses of promoter and transcription patterns

[20]

Ranking Function Preprocessing for Neural Networks

Inasmuch as the use of a ranking function and the use of neural networks are complementary, it seemed reasonable to preprocess the neural net input by providing the ranking function information at each position of the input sequence rather than the raw base together with an evaluation of the whole sequence. Caution is advised when preprocessing data for a neural net because such processing may lead to the loss of information. In this case, additional information is supplied on the fitness of the base relative to the motif ideal together with an overall evaluation of the sequence. Returning to our example, let us see how our query sequence, CAT, might be coded using the RI*BvH function and the six sequence collection. The worst value possible for a position given our sequence collection would be a C or T at Position 2 ¼ 1:07  log½ð5=6Þ=ð1=6Þ ¼ 1:07  0:699 ¼ 0:75:

(27)

The worst possible sequence total would be 1.77, the best being 0.0. If 6 bits (input positions) are devoted to a qualitative scoring of the position value and 4 bits to the sequence total, each three base sequence could be coded in 3 * 6þ4 or 22 bits. Our code could be < ¼ 0.05 ¼ 000001, >0.05< ¼ 0.1 ¼ 000010, > 0.1< ¼ 0.2 ¼ 000100, >0.2< ¼ 0.3 ¼ 001000, >0.3< ¼ 0.5 ¼ 010000, >0.5 ¼ 100000 for positions and <0.5 ¼ 0001, >0.5< ¼ 1.0 ¼ 0010, >1.0< ¼ 1.5 ¼ 0100, >1.5 ¼ 1000 for sequence total. The input for the query sequence would then be 0010000010000000010010, corresponding to 0.29, 0.24, 0.0, and 0.53. All other sequences would be coded accordingly. The RIRI function could be used in the same fashion. This preprocessing step had the effect of reducing the false-positive level by fourfold or more, dropping it to 0.025% or less.11 This level can be reduced even further if one has some other limiting information about the query sequence. For example, if one is looking for a transcriptional factor-binding site in E. coli, one might be willing to add the condition that it is expected to lie with 300 bases of the start point of translation. Assuming 4300 promoters, 300 * 4300/2 * 4600000 suggests that only 1/7 of the genome satisfies this condition, for a false-positive level approaching 1/7*0.025% or 0.003%. If one were doing a genomic search for promoters, this would still yield 1 false positive for each 15 true promoters (see files p16, p17, p18 at http://research.umbc.edu/ moneill for the results of such a search). This procedure provides a general search method that is powerful enough to be of real utility in the case of prokaryotic genomes or smaller searches. Unfortunately, protein-binding sequences in eukaryotes do not

[21]

237

dragon promoter finder

appear to be distinctly larger or more diverse than those of prokaryotes, yet the genome (¼total search space) may be more than a thousand times larger. If the cell were considered as a bag with binding proteins diffusing freely within it, there would not appear to be sufficient information for these proteins to find their specific targets. This paradox would be avoided if the proteins always bound as part of a larger complex or if the proteins were concentrated in specific subdomains, making the cellular concentrations relatively irrelevant. Ending on a practical note, the software for every type of search described here—homology, up to 6 million bases under Windows and alloted memory under Unix with helix search formatting, specific limit on base errors, zero-weighting specific positions, and indeterminate homology search size among other features, RI characterization of sequence families, RIRI or RI*BvH rank ordering of sequence families or sequence searches with a user-set cutoff, neural network searches using base sequences, and neural network searches using ranking function preprocessing—are available, with advice, from the author ([email protected]).

[21] Computational Detection of Vertebrate RNA Polymerase II Promoters By Vladimir B. Bajic and Vladimir Brusic The sequence of the human genome1,2 and sequences of other vertebrate genomes (including mouse, rat, fugu fish, zebra fish) represent raw material for the identification and annotation of complete lists of their genes. The basic, and largely incomplete, annotations are available at the major repositories, such as NCBI3 or Ensembl.4 Gene annotations should provide information on coding, noncoding, and regulatory regions of genes, as well as their regulatory elements. A promoter region is an integral part of a gene that is responsible for the initiation and partial regulation of the gene transcription process.5 Each eukaryotic gene must have at least one 1

E. S. Lander et al., Nature 409, 860 (2001). J. C. Venter et al., Science 291, 1304 (2001). 3 M. Feolo, W. Helmberg, S. Sherry, and D. R. Maglott, Rev. Immunogenet. 2, 461 (2000). 4 T. Hubbard et al., Nucleic Acids Res. 30, 38 (2002). 5 R. O. J. Weinzierl, ‘‘Mechanism of Gene Expression,’’ Imperial College Press, London, 1999. 2

METHODS IN ENZYMOLOGY, VOL. 370

Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00