journal of biotechnology ELSEVIER
Journal of Biotechnology35 (1994) 257-272
Comparing nucleotide and protein sequences by linguistic methods Shmuel Pietrokovski Department of Structural Biology, The Weizmann Institute of Science, Rehovot, Israel
Abstract
Nucleotide and amino acid sequences can be analyzed and compared by their oligomer compositions. Such methods are fundamentally different from comparison methods based on sequence alignment. They are analogous to the linguistic analysis of human texts. The methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences. Significant sequence dissimilarity can also be identified enabling detection of foreign DNA sequences in genomes, genetic libraries and databases. The simplicity and speed of linguistic methods make them very suitable for database searching and maintenance and as a preliminary step to more specific and time-consuming analysis methods. Key words: Sequence linguistics; Sequence analysis algorithm; Sequence comparison; Sequence alignment; Molecular
sequence data; Sequence database; Amino acid sequence; Nucleotide sequence; Oligopeptide; Oligonucleotide
I. Introduction
Nucleic acid and protein molecules determine and modify the structure and function of living ceils and organisms. The blueprints and manufacturing instructions for the cellular machinery and structural components are encoded by molecules of nucleic acids. Protein molecules are those machinery and components. Both types of molecules are linear chains made of specific elements, four nucleotides for nucleic acid molecules ( D N A or RNA) and 20 amino acids for proteins. The nature of the molecules is prescribed mainly, or solely, by their sequence. Large amounts of these biological sequences can now be determined due to the advances in molecular biology, biochem-
istry and genetics in the last two decades. With the accelerating amounts of determined sequence data outstripping our capacity to analyze the data the science of biology is at the threshold of a significant change. The emphasis is turning from data gathering to data analysis. Until a few years ago some phenotype was required in order to isolate the molecular feature (genotype - the gene and its product) controlling it. Today, many nucleotide sequences and the proteins they encode are found with no prior or only partial knowledge of their function (Daniels et al., 1992; Martin-Gallardo et al., 1992; Oliver et al., 1992; Sulston et al., 1992). Currently we have only rudimentary knowledge of the way the primary sequence of nucleic acid and protein
0168-1656/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0168-1656(94)00044-D
258
S. Pietrokovski /Journal of Biotechnology 35 (1994) 257-272
molecules determine their ultimate function. One most often relies, directly and indirectly, on the already known sequences with identified functions to resolve the function of new anonymous sequences. Indirectly, sequence characteristics are defined from known sequences (usually by looking for common patterns) and those characteristics are searched for in newly determined sequences. Directly, known sequences are searched for similar matches to a new sequence. Sequence alignment is the customary way to check for sequence similarity. There are a number of well-defined algorithms for sequence alignment and the method has been mathematically and statistically studied. The method is suitable for identifying homologous sequences, sequences which arose from a common progenitor. However, weak similarities between much diverged homologous sequences are difficult to recognize by alignment. The algorithms which guarantee optimal alignments are time-consuming and thus are not suitable for comparing long sequences and searching databases. Sequence linguistics is the analysis, characterization and comparison of biological sequences by their oligomer composition. Linguistic analysis can identify basic 'words' in nucleotide and amino acid sequences. The sensitivity range of linguistic analysis is wider than that of alignment techniques, encompassing weak and strong similarities of homologous sequences as well as functional and taxonomic relationships. Linguistic analysis can provide an economic way to analyze long and large amounts of sequences. The study of biological sequences by linguistic methods is analogous to the conventional ways by which human texts are analyzed and compared. These include methods which employ frequencies of letter combinations (Bennett, 1976), relative frequencies of key words (Salton, 1991), and word-occurrence patterns and sentence structures (Morton, 1978). In the following, the relation between comparing sequences by linguistic methods and by alignment techniques and some of the features of linguistic methods will be briefly described. Special attention will be given to the linguistic approach developed in our group for sequence com-
parison. Examples of the practical power of sequence linguistics will be demonstrated by results obtained by our approach.
2. Comparing sequences by linguistic and by alignment techniques An alignment of two or more sequences tries to maximize the identities between the sequences by 'sliding' them one over the other and introducing gaps, that is trying to find the minimal/simplest way to transform one sequence into the other(s) by using substitutions and deletions of its elements. It is possible to find the optimal alignment along the entire length of the sequences (Needleman and Wunsch, 1970) or an optimal alignment covering only part of the sequences (Smith and Waterman, 1981). The computation times for these optimal alignment algorithms are proportional to the product of the lengths of the aligned sequences (Waterman, 1984). There are a few related alignment algorithms which are much faster but do not guarantee optimal alignments ( W O R D S E A R C H - Devereux et al., 1984; FASTA - Pearson and Lipman, 1988; BLAST Altschul et al., 1990). In order to align sequences it is necessary to have a matrix specifying the individual scores of similarity between each and every sequence element and a gap penalty that specifies how much does a gap detract from the total score of the identities. It is generally not clear what the parameter values to be used are. The matrix of element identities should be taken according to the relationship being examined. When searching for evolutionary relationships an exchange frequency matrix (mutation matrix), such as the Dayhoff matrix, should be taken. When looking for structural similarities a matrix based on the physical similarities (size, hydrophobicity, charge, etc.) between the elements is normally used (George et al., 1990). The value for the gap penalty parameter is known to be not constant along a sequence (Bowie et al., 1991) (and if so then it might not be constant in different sequences). More generally, while this parameter value is critical for the alignment (von Heijne, 1988) currently there is no
S. Pietrokovski/ Journal of Biotechnology 35 (1994) 257-272
methodological way to choose it (Waterman, 1984; Blaisdell, 1989; Argos and Vingron, 1990; Karlin et al., 1991). Linguistic analysis does not examine a sequence as a whole but by its oligomer makeup. In the method developed in our group (Brendel et al., 1986; Pietrokovski et al., 1990) the attribute of each oligomer is the difference between its observed occurrence in the sequence and its expected occurrence. This difference is termed the oligomer's contrast value, and the contrast values of all the oligomers make up the contrast vocabulary of the sequence. An oligomer with contrast value close to zero occurs as expected, a large positive contrast value shows that the oligomer is over-represented and a large negative *alue shows that it is under-represented. To compare two sequences we compare their contrast vocabularies by calculating the correlation coefficient between the vocabularies. The time for linguistically comparing sequences is proportional to the sum of the sequence lengths. Unlike sequence alignments where a type of 'dot plot' matrix (Goad, 1986) has to be constructed and analyzed, in linguistic analysis the composition of each sequence is calculated separately and then these compositions are compared. The useful, informative, range of linguistic techniques is broad and they can identify sequence relations not found by alignment techniques (Blaisdell, 1986; Pietrokovski et al., 1990). In addition to identifying homologous sequences it is possible to identify sequences with common function and taxonomic origin by their linguistic characteristics. Protein and nucleotide sequences were classified in families by neural networks using the composition of the sequences (Ferran et al., 1993; Wu et al., 1993). An efficient approach for locating protein coding regions by using multiple linguistic characteristics of nucleotide sequences was recently developed (Uberbacher and Mural, 1991). In addition to identifying similar sequences, linguistic methods can locate dissimilarity between sequences (Russel and Subak-Sharpe, 1977; Blaisdell, 1986; Pietrokovski and Trifonov, 1992). This is very useful for identifying foreign nucleotide sequences that trans-
259
posed, or otherwise horizontally moved, between different genomes and organisms or are misidentiffed (White et al., 1993). Examples of those capabilities will be presented below. 3. The c o m m o n and the particular of various linguistic techniques
All methods which analyze sequences by their oligomer composition are termed by us linguistic techniques. The earliest of these date from the 1960s and 70s when sequencing techniques were yet unknown, and later very tedious, but compositions of nucleic acid and protein molecules could be determined by biochemical methods. Thus, base composition ( I c e et al., 1956), nearest neighbor frequencies (corresponding to dinucleotides) (Subak-Sharpe et al., 1966), RNasegenerated fragments (Woese et al., 1976) and DNases site frequencies (Bernardi et al., 1973) were suggested and used to characterize and compare nucleotide sequences. Amino acid sequences were similarly examined (amino acid composition - Comish-Bowden, 1979; Nishikawa et al., 1983; dipeptide composition - Williams et al., 1961; Gibbs et al., 1971; tryptic fragments Sorm et al., 1957). When nucleotide and amino acid sequences became available in sufficient quantities in the 1980s computational techniques were developed for analyzing sequences by their oligomer composition (Claverie and Bougueleret, 1986; Brendel et al., 1986; Blaisdell, 1986, 1989; Volinia et al., 1988; Pietrokovski et al., 1990; van Heel, 1991; Karlin and Brendel, 1992; Solovyev and Makarova, 1993). These techniques share many basic concepts and mainly differ in their specific details. Apart from relying on oligomer occurrences (composition) or their deviations to analyze and compare sequences, the significance of oligomer(s) can also be examined by the deviation of its observed from expected positions in a sequence. An oligomer, or class of oligomers, can be distributed randomly, clustered, overdispersed or excessively even. Statistical analyses of marker spacings were introduced to detect and quantify anomalous distributions of oligomers in amino
260
S. Pietrokovski /Journal of Biotechnology 35 (1994) 257-272
acid and nucleotide sequences (Karlin and Macken, 1991; Karlin et al., 1992; Karlin and Brendel, 1992). A basic classification of linguistic techniques is into those that represent sequences by the observed occurrences (frequencies) of their oligomers and those that represent the sequences by the deviations between the observed and the expected occurrences of their oligomers. Linguistic tech-
niques using observed values were suggested for nucleotide sequences (Blaisdell, 1989) and for amino acid sequences (for example: Gibbs et al., 1971; Cornish-Bowden, 1979; van Heel, 1991) and are applied in sequence database maintenance (Bairoch, personal communication). A comparison of the observed triplet frequencies of amino acid sequence pairs was used to estimate the percentage of identity of the sequences (Jones et al., 1992). While the calculation of observed values is of course faster and simpler than calculating the deviation between the observed and expected values, it has the following major drawback - the observed values of oligomers are not informative by themselves. In order to know whether an oligomer occurs unusually in a sequence (and hence might have some significance) it is necessary to compare the number of times it occurs with the number of times it is expected to occur. Moreover, it is not possible to determine from the observed value of an oligomer which of the following has an unusual occurrence - the oligomer itself, some of the shorter oligomers composing it or some longer oligomer it is part of. A sequence rich in leucine tracts will contain a high frequency of leucines, a low frequency of the dinucleotide CpG (5' cytosine-guanosine) will cause a scarceness of all oligonucleotides containing CpG. The linguistic techniques which utilize the deviations between observed and expected occurrences essentially differ in their determination of the expected values, in their depiction of the deviation between the observed and expected occurrences (frequencies) and in their methods for comparing these lists of oligomer deviations (vocabularies) for different sequences. The expected occurrences of oligomers in a sequence can be determined from the composi-
tion of that sequence, by the composition of other sequence(s) or by a-priori assumptions. The sequence(s) whose composition is used to calculate the expected occurrences must be long enough to give proper sampling of the composition and should be unbiased to give 'true', representative, expected occurrences. The first requirement is usually relevant when the sequence itself is used to determine the expected occurrences of oligomers in it. In such cases, a minimal length of 2000 bases was found by us to be a convenient compromise for calculating the expected occurrences of di- to pentanucleotides (Pietrokovski et al., 1990). Calculating the expected occurrences of long oligomers naturally requires longer sequences than the calculations for shorter oligomers. In particular, the second provision should be taken into account when the expected occurrences are determined using large amounts of sequence data from sequence databanks. Currently, the sequence composition in databanks is biased by being a very small fraction of entire genomes and by over-representing sequences of specific functions, classes and species. Taking the local approach and calculating the expected occurrences in a sequence from its own composition one examines the difference between the actual observed occurrences and the occurrences expected to appear by random. In the global approach the oligomers actual occurrence is compared to their actual occurrence in the representative sequence(s) or to their expected occurrence calculated from the representative sequence(s). A-priori assumptions on the expected occurrences of oligomers can be taken according to theoretical hypotheses in addition to observed sequence data. An example of a simple a-priori assumption is expecting all oligomers to occur equally. This assumption is taken implicitly when calculating Pearson's product moment correlation coefficient between observed occurrences 1. When calculating the expected occurrences it is necessary (in the local approach) or possible (in the global approach) to use the observed occurrence of shorter oligomers. A Markov process or some variation of it is typically used for the calculation. A Markov process uses the frequency
S. Pietrokovski /Journal of Biotechnology 35 (1994) 257-272
of elements or events in a sequence to predict the probability for the next element. The number of prior consecutive elements used to calculate the probability (called the order of the process) can be between 0 and the number of all known elements (the sequence length). When predicting the expected occurrences of oligomers length n in strings the order of the process can be between 0 and n - 1 . A process with the order of 0 is equivalent to the assumption of equal occurrence for all monomers. A process with the order of 1 uses the occurrence of the monomers to calculate the expected occurrence of longer oligomers, and SO o n .
Finding deviations in observed occurrences of oligomers length n by comparing them to their expected occurrence calculated using a Markov process with the order of n - 1 neutralizes the effect of abnormal occurrences of the shorter oligomers composing them (Brendel et al., 1986). This is not guaranteed when using only lower orders. Thus, a study using base composition to calculate the expected frequency of pentanucleotides had to exclude all those pentanucleotides containing the rare dinucleotide CpG and termination codons (Volinia et al., 1988). This procedure for calculating the expected occurrences leads to an independence in the occurrence deviations of oligomers of different lengths and so allows the characterization and comparison of sequences by oligomers of various lengths (Pietrokovski et al., 1990). The use of oligomers of several lengths in analyzing nucleotide sequences was found by us to give better resolution over a wider range of relationships than the use
1 In calculating the Pearson's product moment correlation coefficient between lists of values (characters) X and Y (r(X,Y)) the mean value of each list is subtracted from the values of that list before the values are multiplied: n
E(x r(X,Y)
i=1
xi-.~)2E(Y~-?) i
261
of oligomers of one length (see next section). It is also possible to use the occurrences of all possible oligomers, lengths n - 1 to 1, composing an oligomer to calculate its expected occurrence (Burge et al., 1992). These oligomers also include non-contiguous ones such as the dimer X-any residue-Y. Several workers have studied methods which use the composition of sequences to calculate the expected occurrences of oligomers in order to predict the oligomers actual observed occurrences in other or longer sequences or even in whole genomes (for example: Almagor, 1983; Phillips et al., 1987; Pevzner et al., 1989). Besides the practical uses of these predictions in restriction sites mapping, genetic-libraries screening, sequencing projects and various experiments using the PCR technology, the assumptions at the basis of the methods and the parameters used can help to try and understand what controls and selects the usage of different oligomers in living organisms. Obviously, the objective in these studies is to minimize the differences between the expected and observed occurrences. This is in contrast to the works which employ such differences in order to characterize and define the sequences (Claverie and Bougueleret, 1986; Brendel et al., 1986). Thus, the models used to calculate the expected occurrences of oligomers in the two approaches are fundamentally different. The deviations of the observed occurrences of oligomers from the expected ones can be represented by the difference between the observed and expected occurrences (Brendel et al., 1986; Blaisdell, 1986; Solovyev and Makarova, 1993), by their ratio (Russel and Subak-Sharpe, 1977; Volinia et al., 1988; Burge et al., 1992), or by a negative entropy measure e (Claverie and Bougueleret, 1986). The last measure is taken from the field of information theory and indicates the information (order) associated with the actual appearance of a certain state or oligomer. However, this measure was found to be virtually indis-
2
i~l
If the X i and Yi are observed occurrences (frequencies) this is the same as calculating the coefficient for the differences between the observed and expected occurrences, with the means (X, Y) serving as the expected values.
2Si=-c.Pi.log(pi),
where S i is the contribution of the specific state (oligomer) i to the total negative-entropy of the system, c is a constant, and Pi is the probability for finding the system in the state i (expected occurrence of i).
262
S. Pietrokovski /Journal of Biotechnology 35 (1994) 257-272
tinguishable from a simple probability measure (Claverie and Bougueleret, 1986). Linguistically characterizing sequences frequently requires the identification of oligomers which occur significantly different than by chance. A simple way to do this is by a straightforward calculation of the standard deviate of the difference between the observed and expected occurrences 3 (Brendel et al., 1986). Two works elaborated this formula to generate more exact estimates of the significance of this difference. Pevzner et al. (1989) examined the case of oligomers with internal autocorrelation (whose sequence can partially overlap itself). Stiickle et al. (1990) further developed the simultaneous analysis of several sequences with unequal base distribution. However, in many cases these estimates are not substantially different than the simple ones (Pietrokovski, unpublished results). The significance of the ratios of observed over expected occurrences can be statistically tested (Burge et al., 1992). The linguistic attributes of sequences can be compared by a number of methods. The simplest ways are by the squared Euclidean distance 4 (Russel and Subak-Sharpe, 1977; Blaisdell, 1989; van Heel, 1991) and by Pearson's product moment correlation coefficient 1 (cosine measure of similarity) (Brendel, 1986). In both calculations the linguistic attributes of a sequence (the observed occurrences, or the deviations from, of the oligomers) are treated as coordinates which position the sequence at a point in a multidimensional space where each dimension corresponds to one of the oligomers. The relation between the points can be found from the squared Euclidean distance between them or by the cosine of the angle between the vectors defined by the points. The relative contributions of the specific at-
3 std(i)={f(i)-E(i)I/max{~,
1}, where std(i) is the standard deviate of oligomer i, f(i) is the observed occurrence of oligomer i, and E(i) is the expected occurrence of oligomer i. 4 d,,~, = E in = l ( u i - vi) 2, where du, ~ is the squared Euclidean distance between points u and v, n is the number of dimensions in the space, and u i (v i) is the coordinate of point u (v) in axis i.
tributes (oligomers) to the distance or correlation can be found straightforwardly (van Heel, 1991; Pietrokovski and Trifonov, 1992; respectively). We found the correlation coefficient measures to be more simple and useful than distances. The correlation values are over a defined closed range (1 to - 1 ) which is directly informative - values close to and equal to 1 signify very similar and identical sequence attributes, values around 0 signify no relation (random) between the sequence attributes, and values close to - 1 signify opposite (under-representation vs. over-representation) sequence attributes. In the distance measure the attributes from each compared sequence have to be normalized to some common values (such as frequency). In order to compare distances calculated with different number of attributes (calculations using oligomers of different lengths) the distances themselves must be normalized (Sneath and Sokal, 1973). Finally, if the sequence attributes are expressed as the differences between the observed and expected occurrences of the oligomers it is computationally very simple to calculate Pearson's correlation coefficient between attributes from different sequences 5.
5 If all the oligomers of a certain length in a sequence are characterized by the difference between their observed and expected occurrences (contrast value) their sum of contrast values will be equal to 0, since the sum of all their observed occurrences is equal to the sum of all their expected occurrences. Hence, the average contrast value is also 0 and does not have to be subtracted when calculating Pearson's product moment correlation coefficient (footnote 1). Furthermore, it can also be shown that if the expected occurrences are calculated with a Markov process with an order of n - 1 (where n is the oligomers length) then the sum of all oligomers differing only in their last (or first) element (letter) will also be 0, for the same reason. These properties mean that for every oligomer(s) over abundant there i s / a r e other oligomer(s) under abundant to the same degree, this is true for the set of all oligomers of the same length and for sets of oligomers differing only in their first or last letters. Viewing the contrast values as describing a vector in a k dimensional space (where k is the number of different oligomers) this also means that the vectors are actually confined to sub-regions in the space which can be defined by another space with only k .(a - 1 ) / a dimensions, where a is the alphabet size. When calculating the contrast values these properties can be used to check the preciseness of the calculation of the contrast values or to reduce their computation time.
S. Pietrokouski / Journal of Biotechnology 35 (1994) 25 7-2 72
The distribution function of Pearson's correlation coefficient, necessary for evaluating its significance, depends on the distribution functions of the data (Conover, 1980). These distribution functions are not generally known but can be empirically estimated (Pietrokovski et al., 1990). A number of measures of rank correlation have
dimers
263
distribution functions independent from those of the data and thus their significance can be easily evaluated. Spearman's rank correlation, Rho, and Kendall's Tau are such measures which were used to compare oligomer compositions of nucleotide and amino acid sequences (Mani, 1992; Karlin et al., 1992; Karlin and Bucher, 1992).
trimers
80
6O
60
40
40
non h u m a n
JL, ,_1~~]1\ \ # g]o
EO 0
0 01
6 g[obJn R \ 7 ~lobin
.
-0.5
-0.25
0 1
o
0,25
0.5
r
f
.....
-0.5
0.75
h . . . .
-0.25
~n . . . .
0
i
0.25
,
0.5
0.75
,
tetramers
pentamers
80
F.
6O
60
o
~J
'ibil s
. . . . .
20
,.o
E 0
Z
0 -0.5
.... '~"
-0.5
-0.25
0
0.25
0.5
0.75
9'
20
globins
Z -0.25
0.~5
0
0.5
0.75
1
80 60 40 EO o -0.5
-0.25
0
Correlation
0.25
0.5
0.75
1
Coefficient
Fig. 1. Correlation and similarity values distributions. T h e h u m a n fetal y-globin gene cluster ( E M B L entry H S G L B N ) was compared with all h u m a n entries in the E M B L Nucleotide Sequence Data Library, Release 17, longer than 1000 bases (930 sequences totaling 2.6 Mb). Inadvertently a few n o n - h u m a n entries (herpes simplex virus, trypanosome) were included. T h e distribution histogram of the correlation coefficient scores of the contrast dimer, trimer, tetramer and p e n t a m e r values and $2_ 5 similarity values (mean of dimer to p e n t a m e r correlation values) of the comparison are plotted. Globin, n o n - h u m a n and r R N A sequences in the tails of the histograms are indicated.
264
S. Pietroko~'ski /Journal of Biotechnology 35 (1994) 257-272
4. Identifying similar (homologous) and dissimilar (taxonomically and functionally) sequences Databanks of biological sequence increasingly serve not only as data deposits but also as research resources (Gilbert, 1991). These databases contain a very large amount of sequences which have complex relations with each other, with other data and with other databases. U p o n determining a (putatively) new sequence it is necessary to define its relation with the already known sequences (Doolittle, 1986). Finding that a new sequence is related by common origin (homologous) to known sequences in the same or other species often leads to important conclusions, specifically about the function and origin of the sequences and in general about the nature of evolution, genome structure and genome development. Searching for homologous sequences is a quantitatively and qualitatively difficult task (Doolittle, 1990; Pearson, 1990). In analyzing sequences from databases it is usually important to identify functionally or taxonomically mislabeled entries. Recognition of such entries can lead to more complete and economic analyses and clearer and more correct results. The human fetal 7-globin gene cluster was compared by our nucleotide linguistic method with all E M B L data library human entries longer than 1000 bases. The sequences were compared by di-, tri, tetra- and pentanucleotides and their $2 5 similarity, the mean of the di- to pentanucleotide correlation scores, calculated (Fig. 1). Each histogram has its own distinct shape, i.e., mean score, width and tails. Going from the dimer to pentamer histograms, the mean score decreases from 0.9 to 0. The widths (the score range containing the central 98% of the histogram's area) of the dimer and p e n t a m e r histograms are relatively narrow (0.39 and 0.27, respectively), while those of the trimer and tetramer histograms are wider (0.77 and 0.48, respectively). The similarity score histogram has a mean score of 0.35 and a width of 0.41. The dimer and trimer histograms have tails of low scores which are distinct from the main body of the histogram, the tetramer and p e n t a m e r histograms have such distinct tails at their high score
ends and the similarity score histogram has distinct tails at both ends. The decrease in the correlation scores with the rise in word length is due to the different 'typical similarity' between vocabularies of unrelated human sequences in each oligomer length. This is true for the sequences of all species we examined. The dinucleotide vocabularies of unrelated sequences from the same genome are, in general, very similar (Pietrokovski and Trifonov, 1992). The vocabularies of oligonucleotides lengths four or more of such sequences have typical correlation scores around zero. The width of each histogram shows the range of relatedness between the inspected set of sequences and the probing sequence for that histogram's word length. It is interesting to note that the trimer histogram is the widest histogram by far. This is perhaps due to the fact that most of the sequences in the examined set code for various proteins. Codons, trimers coding for amino acids, are probably a very important class of 'words' in these sequences and also the ones which discriminate them from each other. Species- and gene-specific codon usage is, in fact, well documented (Grantham, 1980). These sequences, very heterogeneous in terms of the proteins they encode, are shown to be most diverse in their trimer vocabularies. The histograms distinct tails indicate the resolution of the different word lengths. That is, the ability to distinguish the sequences which are very similar or dissimilar to the probing sequence from the rest of the sequences in the surveyed set. Only the similarity histogram has tails at both edges, thus detecting well both the sequences which are very similar and those dissimilar to the probe sequence, respectively, in this case globin genes and the r R N A and non-human sequences. The similarity between the 7-globin genes and the probing sequence is due to homology. Thus these genes are clearly discerned, not only by their similarity scores, but by the tetramer and p e n t a m e r correlation scores as well. The dissimilar sequences are set apart in the dimer and trimer correlation scores histograms and in the similarity score histogram. The dissimilar sequences are either taxonomically (non-human se-
S. Pietrokouski / Journal of Biotechnology 35 (1994) 257-272
quences) or functionally (non-protein-coding (rRNA) genes) different. By virtue of being a combination of several independent traits of the compared sequences (the different word length correlation scores) the similarity score is generally a more informative measure of relatedness/
265
unrelatedness than any single one of its components. The data in Fig. 1 are typical and representative of several such comparisons made using human, rodent and insect sequences as probes and examined sets.
0.3
0.3
0.2
0.2
0.I
0.I
B ~
F
0.4
0.4
0.3
03
0.2 0.i
0.I
0.4
0.4
G
,,J .... l
Gq
l c,J
0.3
0.2
02
0.1
0.I I ....
I
,,I
....
I ....
I ....
I
D
0.4
0.4
0.3
0.3
[1.2
0.2
0.I
0.1
H ~
,
1000 base~
Position Fig. 2. Promoter signals in human genes. The contrast vocabularies of 69 human protein coding promoters (defined as the regions 199 bases upstream to 50 bases downstream of the transcription starts,mutually non-homologous (Bueher, 1988) and without any
unidentified bases) were linearly summed for each word length. The same was done with these promoters complementary sequences and the two vocabularies linearly summed, to construct a compound human promoter vocabulary. Human protein coding genes, not included in the compound vocabulary, with known transcription start site, long flanks around it (at least 600 bases upstream and 1200 bases downstream) and with no unidentified bases were scanned with this vocabulary using a window of 250 bases and a step of 50 bases. The $2_ 5 similarity values of the fragments with the promoter vocabulary were calculated and plotted for the center of each fragment. The arrows point to the transcription start site of each gene. The sequences were taken from the EMBL Nucleotide Sequence Data Library, Release 17, entries: HSIFD1 (A), HSAIATP (B), HSPRPH1 (C), HSAGG (D), HSTPA (E), HSTGFBG1 (F), HSGLUCG2 (G) and HSMETIA (H).
266
S. Pietrokovski/Journal of Biotechnology 35 (1994) 257-272
00i>
5.Locatingeukaryoticpromoters Eukaryotic promoters enable and regulate the transcription of genes. The promoters are relatively short sequences usually located upstream of the genes. The binding of certain proteins to short specific sequences in the promoter is crucial for the promoter function (Mitchell and Tjian, 1989). Although heterogeneous in sequence the eukaryotic promoters generally contain several characteristic oligomers (Bucher and Trifonov, 1986). Nevertheless, the proper identification of eukaryotic promoters by sequence analysis is still a challenge (for example: Staden, 1988; Bucher, 1990). In order to try to map eukaryotic promoters a proper probe was constructed, a contrast word vocabulary capable of identifying these promoters. Human POL II promoter sequences were chosen from the Eukaryotic Promoter Database (Bucher and Trifonov, 1986; Bucher, 1988). All the human promoters meeting the terms specified in the legend for Fig. 2 were taken. The contrast word vocabularies of the promoters and of their complementary sequences were determined and summed up separately. Preliminary analysis showed that both vocabularies are capable of identifying P O L II eukaryotic promoters. In order to amplify the promoter recognition abilities of the vocabularies by supplementing each other the two vocabularies were added to form the compound promoter vocabulary. This last addition, of course, made the vocabularies self complementary, every word has the same contrast value as its complementary. In consequence, any one of the two strands of each inspected sequence could be scanned for the promoter mapping purposes. The results of scanning eight human protein coding genes are shown (Fig. 2). The genes were randomly selected. While some promoters appear to give clear strong signal (Fig. 2C, D and G), some appear to give inconclusive signal (Fig. 2E and H) and some are seemingly undetected (Fig. 2A, B and F). In order to resolve this problem, more genes were examined and the resulting plots were summed up (Fig. 3). When no more suitable human genes were to be found rat protein coding
0.26 0.20
i
i
i
i
i
]
i
i
B
0.30
0.25 0.20
~. 0.30
uq
I ~
C
0.25
I
0.20 l
E
0.30
,
,
,
]
,
~
0.25 0.20
. . . . . . . 1
i .... 1000
I , 2000
Position
( r e l a t i v e to t r a n s c r i p t i o n
st.art)
Fig. 3. Promoter signals summations. Promoter signals of individual protein coding genes (as in Fig. 2) were summed up and normalized. (A and B) Summations of 8 and 7 human genes, respectively (A is the summation of the genes in Fig. 2). (C) Summation of 8 rat genes. (D) Summation of the 23 human and rat genes in A, B and C. All sequences were taken from the EMBL Nucleotide Sequence Data Library, Release 17.
genes were also used. As noted, not all promoters gave conclusive results but no promoter was mapped in an area of low similarity values (not shown). The plots were first summed in three separate groups (Fig. 3A, B and C) and then summed all together (Fig. 3D). A positive signal is seen in the partial summations, where the amplitudes of the three transcription-start peaks corresponds to 1.7, 2.3 and 2.3 standard deviates,
S. Pietrokovski /Journal of Biotechnology 35 (1994) 257-272
respectively. The signal for transcription start is well pronounced in the total summation. As expected from a bona fide signal it is maintained at the total summation, while the noise around it is damped. The positive signals in all partial summations show that the signal does not arise from a specific subset of the examined genes but is rather equally distributed. An evident correlation is seen between the examined promoters and the compound contrast promoter vocabulary.
6. Detecting retroviral-like sequences among rodent genomic sequences
Retroviruses integrate their genome into the D N A of their host cell during their life cycle. Occasionally retroviral sequences become integrated in a germ line cell and turn into an endogenous part of the genome. Consequently, most vertebrate genomes stably contain some sequences of retroviral origin (Varmus and Brown, 1989).
In order to identify retroviral-like rodent genomic sequences, the sequence of the human endogenous K10 retrovirus was linguistically compared with all long mouse and rat sequences in the EMBL nucleotide sequences databank. To evaluate the results the databank descriptions of the compared sequences were searched for retroviral-like sequences. The entries descriptions identified 18 sequences as retroviral-like (Table 1). Examining the distribution of the $2_ 5 similarity values of all the sequences (Fig. 4) four retroviral-like sequences have the highest similarity values, four are in the top 10% of the examined set, nine are slightly above the mean value and only one sequence is below the mean value. Appraising the result, the heterogeneity of the retroviral-like rodent sequences should be taken into account. These sequences are quite varied, some look like having only phenotypic traits to link them with retroviruses (Table 1). To critically assess this result we ran the UWGCG (release 5.1) p r o g r a m W O R D -
Table 1 Viral-like rodent sequences in the EMBL data library release 17 Entry name
Length (bases)
Description
MMCCE MMERE1M MMERFV41 MMERFV42 MMERMMTR MMETNA MMETNB MMIAP1 MMIAP2 MMIAPIL3
1236 1240 2084 1425 1754 5540 5032 2217 2217 5095
MMIAPIS2 MMIAPIS3 MMIGHXI MMMIARN MMMULV1 MMMURS RNRAL10 RNRAL6
1319 1419 1350 3600 1700 5689 1136 3317
Murine cellular enhancer of retroviral gene expression, complete cds. Mouse (strain 129 G-IX + ) endogenous murine leukemia virus mRNA, clone El. Mouse endogenous retrovirus in Fv-4 locus, 5' segment. Mouse endogenous retrovirus in Fv-4 locus, 3' segment. Mouse endogenous mammary tumor virus (MMTV) RNA, env gene and right LTR. Mouse early transposon (ETn). Mouse early transposon (ETn), partial. Murine mRNA fragment for gag related peptide. Murine (DBA/2) mRNA fragment for gag related peptide. Mouse intracisternal A-particle IAP-IL3 genome deleted type I element inserted 5' to the interleukin-3 gene. Mouse myeloma type II lAP element with AIIins ( = A-particle type II insertion). Mouse embryo type II IAP element with AIIins (= A-particle type II insertion). Mouse Ig yl-chain switch region gene with ETn LTR insertion. Mouse ren-2 lAP genome (MIARN) (intra-cisternal A particle). Mouse genomic murine leukemia virus (MuLV) related sequence (LTR-gag). Mouse retrovirus-related DNA sequence (MuRRS). Rat DNA for LTR-like repetitive RAL element RAL (pRAL10). Rat DNA for LTR-like repetitive RAL element (clone pRAL6).
a
267
a
Descriptions from the EMBL Nucleotide Sequence Data Library, Release 17.
S. Pietrokovski/Journal of Biotechnology 35 (1994) 257-272
268
. . . .
I
. . . .
I
. . . .
I
. . . .
i
. . . .
I
. . . .
I
. . . .
I
. . . .
l
60 O3 0 %q
40 0
0
E
20
z
0
b
-0.5
,
,
,
I
. . . .
-0.25
nine,
0
,i
I I
0.25
,
0.5
0.75
1
2 - 5 similarity Fig. 4. Linguistic technique database searches. Distribution of $2_s similarity values between the human endogenous K10 retrovirus genome (EMBL entry HSERVK10)and all EMBL mouse and rat entries longer than 1000bases (460 and 381 sequences, respectively, totaling 1.9 Mb). The distribution of the viral-like sequences (Table 1) is darkened. Sequences were taken from the EMBL Nucleotide Sequence Data Library, Release 17.
S E A R C H (Devereux et al., 1984) using the same probe and surveyed sequences used in the linguistic comparison. The W O R D S E A R C H program uses the Wilbur and Lipman algorithm (Wilbur and Lipman, 1983) to search a set of sequences for sequences having short consecutive identical oligomers ('diagonals') with a probe sequence. The results of the W O R D S E A R C H algorithm were inferior to those obtained by the linguistic technique, detecting only one of the expected sequences (although that sequence had the highest score). Rodent sequences of retroviral origins were discerned from rodent genomic sequences by virtue of their linguistic traits. Some of the sequences were clearly distinguished and some only generally identified as more similar than most sequences (and thus candidates for further more rigorous inspections), while an alignment-based
algorithm was unable to identify all save one of the retroviral-like sequence.
7. Identifying imported sequences in the mitochondrial yeast genome In addition to the genes and sequences appearing in all known mitochondrial (mt) genomes the mt genome of the yeast Saccharomyces cerevisiae contains another type of sequences. These sequences are found in the mt genomes of yeast, filamentous fungi and plants and are involved in splicing, intron transposition and ribosome assembly (Chomyn and Attardi, 1987; Levings and Brown, 1989). In S. cerevisiae these sequences are mainly composed of group I and II introns and intergenic open reading frames (ORFs) and are apparently largely dispensable. Many of the in-
S. Pietrokouski / Journal of Biotechnology 35 (1994) 257-272
..~ ~ o f f .
g
8~.o= ~ = ~_._'~ |
~.~o~ E ~ o--,;Z
g
g,.~Xo
~'¢ <
oo
=o # =~ j~"= ~
o
"F-.
~'~ ."~ " 0
~
0
2 e~
P
~ ~ -~ ~ . ~ -
O o
_o =
~
'
~ "~
=o~
[_.
~,~,
e.
t-.
<
o
in
lU~to!J|ooo
o
UO!lgla.l.zoo
in
o
J[OUll(]
~'~
i
269
trons contain ORFs, some of which were shown by genetic and biochemical means to be involved in splicing and transposition of the mt introns. These sequences are hypothesized to be mobile genetic elements capable of intra- and inter-genomic movements. Comparing the S. cerevisiae mt genome with the dinucleotide vocabulary of the yeast nuclear genome identified several regions which, unlike most of the genome, had positive correlation with the nuclear vocabulary (Fig. 5). Almost all these regions were in intronic and intergenic ORFs, and only a few of the intronic and intergenic ORFs did not have this positive correlation. Upon a more refined analysis the source of the positive correlations was localized to relatively short core fragments, most of which turned out to code for conserved amino acid motifs which are probably vital for the function of the ORFs products. The contrast values ( o v e r / u n d e r abundance) of three dinucleotides was found to be the cause of the positive correlation in all of the fragments. Most likely the yeast mt intronic and intergenic ORFs are of foreign origin. Their original foreign sequence characteristics were largely preserved in their conserved regions, while the other regions of the ORFs have acquired more mt characteristics. Indeed, the features distinguishing the mt genome from the conserved regions of the ORFs were found to be selected for in the intergenic spacer sequences of the genome (de Zamaroczy and Bernardi, 1987), these sequences are probably under weak functional selection and thus strongly influenced by the mt genomic selection. The linguistic analysis of nucleotide sequences discussed above identified segments of the sequences encoding distinct amino acid motifs. Although the nucleotide sequences directly encode the amino acid sequences this accomplishment is not obvious. The genetic code is degenerate, almost any amino acid sequence can be encoded by a number of different nucleotide sequences. The amino acids are also specified by codons - consecutive, non-overlapping, nucleotide triplets, while the linguistic analysis used was of all overlapping dinucleotides. The ability to identify amino acid features from analysis of nucleotide
270
S. Pietrokovski / Journal of Biotechnology 35 (1994) 257-272
sequences enables one to uniformly analyze nucleotide sequences, not necessarily encoding only proteins. (A more comprehensive description and discussion of these results can be found in Pietrokovski and Trifonov, 1992.)
References Almagor, H. (1983) A Markov analysis of DNA sequences. J. Theor. Biol. 104, 633-645. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410. Argos, P. and Vingron, M. (1990) Sensitivity comparison of protein amino acid sequences. Methods Enzymol. 183, 352-365. Bennett Jr., W.R. (1976) Language. In: Scientific and Engineering Problem Solving with the Computer. Prentice-Hall, Englewood Cliffs, NJ, pp. 103-198. Bernardi, G., Ehrlich, S.D. and Thiery, J. (1973) The specificity of deoxyribonucleases and their use in nucleotide sequence studies. Nature New Biol. 246, 36-40. Blaisdell, B.E. (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. USA 83, 5155-5159. Blaisdell, B.E. (1989) Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. J. Mol. Evol. 29, 526537. Bowie, J.U., Liithy, R. and Eisenberg, D. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164-170. Brendel, V., Beckmann, J.S. and Trifonov, E.N. (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomol. Struct. Dyn. 4, 11-21. Bucher, P. (1988) The Eukaryotic Promoter Database of the Weizmann Institute of Science. EMBL Nucleotide Sequence Data Library Release 17, Postfach 10.2209, D-6900 Heidelberg. Bucher, P. (1990) Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563578. Bucher, P. and Trifonov, E.N. (1986) Compilation and analysis of eukaryotic Pol II promoter sequences. Nucleic Acids Res. 14, 10009-10026. Burge, C., Campbell, A.M. and Karlin, S. (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA 89, 1358-1362. Chomyn, A. and Attardi, G. (1987) Mitochondrial gene products. Curr. Topics Bioenerg. 15, 295-329. Claverie, J. and Bougueleret, L. (1986) Heuristic informational analysis of sequences. Nucleic Acids Res. 14, 179196.
Conover, W.J. (1980) Practical Nonparametric Statistics. John Wiley & Sons, New York, NY. Cornish-Bowden, A. (1979) How reliably do amino acid composition comparisons predict sequence similarities between proteins? J. Theor. Biol. 76, 369-386. Daniels, D.L., Plunket, G., Burland, V. and Blattner, F.R. (1992) Analysis of the Escherichia coli genome; DNA sequence of the region from 84.5 to 86.5 minutes. Science 257, 771-778. de Zamaroczy, M. and Bernardi, G. (1986) The primary structure of the mitochondrial genome of Saccharornyces cerevisiae - a review. Gene 47, 155-177. de Zamaroczy, M. and Bernardi, G. (1987) The AT spacers and the varl genes from the mitochondrial genomes of Saccharomyces cerevisiae and Torulopsis glabrata: evolutionary origin and mechanism of formation. Gene 54, 1-22.
Devereux, J., Haeberli, P. and Smithies, O. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387-395. Doolittle, R.F. (1986) Of Urfs and Orfs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA. Doolittle, R.F. (1990) Searching through sequence databases. Methods Enzymol. 183, 99-110. Ferran, E.A., Ferrara, P. and Pflugfelder, B. (1993) Protein classification using neural networks. In: Proceedings of First International Conference on Intelligent Systems for Molecular Biology, pp. 127-135. George, D.G., Barker, W.C. and Hunt, L.T. (1990) Mutation data matrix and its uses. Methods Enzymol. 183, 333-351. Gibbs, A.J., Dale, M.B., Kinns, H.R. and MacKenzie, H.G. (1971) The transition matrix method for comparing sequences; its use in describing and classifying proteins by their amino acid sequences. Syst. Zool. 20, 417-425. Gilbert, W. (1991) Towards a paradigm shift in biology. Nature 349, 99. Goad, W.B. (1986) Computational analysis of genetic sequences. Annu. Rev. Biophys. Biophys. Chem. 15, 79-95. Grantham, R. (1980) Workings of the genetic code. Trends Biochem. Sci. 5, 327-331. Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Computer Appl. Biol. Sci. 8, 275-282. Karlin, S. and Brendel, V. (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257, 39-49. Karlin, S. and Bucher, P. (1992) Correlation analysis of amino acid usage in protein classes. Proc. Natl. Acad. Sci. USA 89, 12165-12169. Karlin, S., Bucher, P., Brendel, V. and Altschul, S.F. (1991) Statistical methods and insights for Protein and DNA sequences. Annu. Rev. Biophys. Biophys. Chem. 20, 175203. Karlin, S., Campbell, A.M. and Burge, C. (1992) Statistical analyses of counts and distributions of restriction sites in DNA sequence. Nucleic Acids Res. 20, 1363-1370.
S. Pietrokouski / Journal of Biotechnology 35 (1994) 257-272 Karlin, S. and Macken, C. (1991) Assessment of inhomogeneities in an E. coli physical map. Nucleic Acids Res. 19, 4241-4246. Lee, K.Y., Wahl, R. and Barbu, E. (1956) Contenu en bases puriques et pyrimidiques des acides desoxyribonucleiques des bacteries. Ann. Inst. Pasteur 91,212-224. Levings, C.S. and Brown, G.G. (1989) Molecular biology of plant mitochondria. Cell 56, 171-179. Mani, G.S. (1992) Correlations between the coding and noncoding regions in DNA. J. Theor. Biol. 158, 429-445. Martin-Gallardo, A., McCombie, W.R., Gocayne, J.D., FitzGerald, M.G., Wallace, S., Lee, B.M.B., Lamerdin, J., Trapp, S., Kelley, J.M., Liu, L-I., Dubnick, M., JohnstonDow, L.A., Kerlavage, A.R., de Jong, P., Carrano, A., Fields, C. and Venter, C. (1992) Automated DNA sequencing and analysis of 106 kilobases from human chromosome 19q13.3. Nature Genet. 1, 34-39. Mitchell, P.J. and Tjian, R. (1989) Transcriptional regulation in mammalian cells by sequence specific DNA binding proteins. Science 245, 371-378. Morton, A.Q. (1978) Literary Detection: How to Prove Authorship and Fraud in Literature and Documents. Charles Scribner's Sons, New York, NY. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444-453. Nishikawa, K., Kubota, Y. and Ooi, T. (1983) Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. J. Biochem. 94, 981-995. Oliver, S.G., van der Aart, Q.J., Agostoni-Carbone, M.L., Aigle, M., Alberghina, L., Alexandraki, D., Antoine, G., Anwar, R., Ballesta, J.P., Benit, P., et al. (1992) The complete DNA sequence of yeast chromosome III. Nature 357, 38-46. Pearson, W. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63-98. Pearson, W. and Lipman, D. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448. Pevzner, P.A., Borodovsky, M. Yu. and Mironov, A.A. (1989) Linguistics of nucleotide sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn. 6, 1013-1026. Phillips, G.J., Arnold, J. and Ivarie, R. (1987) Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res. 15, 2611-2626. Pietrokovski, S., Hirshon, J. and Trifonov, E.N. (1990) Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. J. Biomol. Struct. Dyn. 7, 1251-1268. Pietrokovski, S. and Trifonov, E.N. (1992) Imported sequences in the mitochondrial yeast genome identified by nucleotide linguistics. Gene 122, 129-137. Russel, G.J. and Subak-Sharpe, J.H. (1977) Similarity of the
271
general designs of protochordates and invertebrates. Nature 266, 533-536. Salton, G. (1991) Developments in aut~)matic text retrieval. Science 253, 974-980. Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular sub-sequences. J. Mol. Biol. 147, 195197. Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. W.H. Freeman, San Francisco, CA. Solovyev, V.V. and Makarova, K.S. (1993) A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Comput. Appl. Biol. Sci. 9, 17-24. Sorm, F., Keil, B., Holyesovky, V., Knesslov~, V., Kostka, V., M~isiar, P., Meloun, B., Mikes, O., Tom~isek, V. and Vanecek, J. (1957) Coll. Czech. Chem Commun. 22, 1310. Staden, R. (1988) Methods to define and locate patterns of motifs in sequences. Comput. Appl. Biol. Sci. 4, 53-60. Stiickle, E.E., Emmrich, C., Grob, U. and Nielsen, P.J. (1990) Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 6641-6647. Subak-Sharpe, H., Burk, R.R., Crawford, L.V., Morrison, J.M., Hay, J. and Keir, H.M. (1966) An approach to evolutionary relationships of mammalian DNA viruses through analysis of the pattern of nearest-neighbor base sequences. Cold Spring Harbor Symp. Quant. Biol. 31, 737-748. Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S., Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M., Hawkins, T., Ainscough, R. and Waterston, R. (1992) The C. elegans genome sequencing project: a beginning. Nature 356, 37-41. Uberbacher, E.C. and Mural, R.J. (1991) Locating protein coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261-11265. Varmus, H. and Brown, P. (1989) Retroviruses. In: Berg, D.E. and Howe, M.M. (Eds.), Mobile DNA, American Society for Microbiology, Washington, DC, pp. 53-108. van Heel, M. (1991) A new family of powerful multivariate statistical sequence analysis techniques. J. Mol. Biol. 220, 877-887. Volinia, S., Bernardi, F., Gambari, R. and Barrai, I. (1988) Co-localization of rare oligonucleotides and regulatory elements in mammalian upstream gene regions. J. Mol. Biol. 203, 385-390. von Heijne, G. (1988) Getting sense out of sequence Data. Nature 333, 605-607. Waterman, M.S. (1984) General methods of sequence comparison. Bull. Math. Biol. 46, 473-500. White, O., Dunning, T., Sutton, G., Adams, M., Venter, J.C. and Fields, C. (1993) A quality control algorithm for DNA sequencing projects. Nucleic Acids Res. 21, 3829-3838. Wilbur, W.J. and Lipman, D.J. (1983) Rapid similarity
272
S. Pietrokouski /Journal of Biotechnology 35 (1994) 257-272
searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726-730. Williams, J., Clegg, J.B. and Mutch, M.O. (1961) Coincidence and protein structure. J. Mol. Biol. 3, 532-540. Woese, C., Sogin, M., Stahl, D. Lewis, B.J. and Bonen, L. (1976) A comparison of the 16S ribosomal RNAs from mesophilic and thermophilic bacilli: some modifications in
the Sanger method for RNA sequencing. J. Mol. Evol. 7, 197-213. Wu, C., Berry, M., Fung, Y.S. and McLarty, J. (1993) Neural networks for molecular sequence classification. In: Proceedings of First International Conference on Intelligent Systems for Molecular Biology, pp. 429-437.