Global complexity analysis of genomes

Global complexity analysis of genomes

BioSystems, 30 (1993) 201-214 201 Elsevier Scientific Publishers Ireland, Ltd. Global complexity analysis of genomes V. D. Gusev a, V. A. Kulichkov...

906KB Sizes 0 Downloads 66 Views

BioSystems, 30 (1993) 201-214

201

Elsevier Scientific Publishers Ireland, Ltd.

Global complexity analysis of genomes V. D. Gusev a, V. A. Kulichkov b and O. M. Chupakhina a aInstitute o] Mathematics, Russian Acad. Sci., Siberian Branch, Universitetskij prospekt 4, Novosibirsk, 630090, Russia and bResearch Institute of Molecular Biology, Koltsovo, Novosibirsk region, 633159, Russia

Introduction The complexity analysis of genomes can be performed on two levels: local and global The former implies the discovering, classification and interpretation of local fragments possessing the anomalous complexity. The latter implies estimates of the genome complexity as a whole, and delineation of links between local (not necessarily anomalous) fragments arbitrarily distanced along the genome. Out of all diversity of links we choose ones that permit sufficiently simple complexity interpretation. Thus they are searched for by algorithmically close methods based on the notion of the complexity of a finite sequence. Methods and results of the local complexity analysis of various genomes are presented in the accompanying paper (Gusev et al., this volume). Here we consider the global level. Notation of (Gusev et al., this volume) is retained. Estimates of the genome complexity as a whole can be useful for its classification. In Section 1 a set of complexity characteristics is suggested and the comparative analysis of genomes is performed. The basic set can be simply enlarged. The results of this section are related to recent papers on the fractal representation of genomes (e.g., Jeffrey, 1990). In particular, the introduced characteristics C in the substance play the same role as the fractal (non-integer) dimension. For many dynamic systems the fractal structure and dimension are the main classification characteristics. Analysis of associative links between local fragments are considered in Section 2. This problem is one of the most actual in computer genetics. A particular case of it is the homology search. The suggested approach is specific, since it deals with the associative links on the structural level. The search for structural homologies is important in the analysis of strongly diverged genomes. Three approaches to the association search are considered: (i) search for homologous fragments based on islands of the structural similarity; (ii) search for strong regularities (e.g. long terminal repeats) based on the consequential analysis of the extremes of the complexity profile; (iii) delineation of isomorphic fragments (i.e. fragments coinciding up to renaming of the elementary symbols). Section 3 is dedicated to non-standard uses of the complexity analysis that can be considered in the general frame of the global approach. Among them is the search for the punctuation marks of a fixed type, fast search for repeats, stem-loops, symmetries with the distance between the elements not exceeding a given threshold, compression of the genetic texts based on the complexity measures C1 and C2. This section is somewhat more schematic and leaves some breathing space for a potential user. It objective is to demonstrate some possibilities of the complexity approach that fall outside of the usual framework. These approaches were tested in some M.Sc. theses written under the authors' supervision and proved to be sufficiently successful. 0303-2647/93/$ 06.00 (~)1993 Elsevier Scientific Pubfishers Ireland, Ltd. Printed and Published in Ireland

202 Table 1 T h e complexities of several g e n o m e s in 5' --~ 3' and 3' -+ 5' directions (measures 01 and C2). Genome

Length 01(5:3 ! CI(3~5 ! C2(5~3 ! C2(3-5)

q/XI74

G4

SV40

5386 995 992 880 872

5577 1018 1019 903 901

5243 933 929 815 825

B~

4963 878 883 772 776

~IB2

3569 732 723 624 628

POLT01

7440 1313 1321 1163 1166

ASV473

3718 730 729 635 628

1. C o m p l e x i t y c h a r a c t e r i s t i c s o f a c o m p l e t e g e n o m e The simplest characteristics of this kind are genome complexities in 5' ---* 3' and 3' ~ 5' directions (Table 1). In general these characteristics are not too informative, since they depend on the sequence length and, moreover, not linearly. This means that the normalization by dividing on the sequence length is not correct. More sensible, but requiring additional research when applied to natural genomes, is the normalization C1/(N/log~ N) determining the complexity of random sequences (here g is the sequence length, ~ = IZI is the number of elements in the alphabet E). We do not consider here all possible normalizations, since below another method of comparison of genomes would be described. Note only that quite reasonable and seemingly requiring no further normalization is the ratio C2/C1 which implicitly characterizes the extent of symmetrical and complementary motives in a genome (the smaller is C2/C1, the larger is the weight of such structures). Returning to Table 1, note the closeness of the complexities in both directions for each genome. It is largely caused by the small number of symbols in the alphabet, since it is simple to construct a sequence whose complexities in two directions would differ at least in ]~1/2 times. In other words, for alphabets of a large size the difference can be rather substantial. Consider now another integral characteristic of a genome, namely, the complexity histogram. It is remarkable, since some of its parameters do not depend on N, which allows one to avoid the normalization problem. let Ci (i = 1,... , N - D + 1) is the sequence of values forming the complexity profile of a genome (Gusev et al., this volume) for the window of size D, and let Cmi,(D), C(D) and CE~x(D) be respectively the minimum, the mean and the maximum values of Ci, let also s(D) the standard deviation. The complexily histogram is the ordered by increase of C set of values Hg(d) = {h(C), Crnin(D) < C < Cm~,(D)}, where C is an integer, h(C) is the number of elements Ci (i = 1,... , N - D + 1) numerically equal to C. The described above histogram parameters form an easy to interpret set of integral characteristics of a genome, that can be used in classification problems. We can extend the space of parameters by fixing various values of D and obtaining for each of them the vector of parameters. In particular, when genomes of varying length are considered, it is possible to use C'(D), which for D << N is almost independent on the genome length. In Table 2 we present as an illustration the complexity histograms for several genomes. In Tables 3 and 4 we present (for window sizes D = 20 and D = 150 respectively) comparison of the histogram parameters for the phage ), and ten random copies of it obtained by shuffling. The analysis of these tables allows one

203

to form the following conclusions. (A) Parameters C(D) and (s(D) are close for related genomes (cf. ¢X174 and G4, SV40, BKVMM and BKVDUN, mitochondria HUMMT and BOVMT). Hemagglutinins of the influenza virus (strains P and JAPAN) belong to the different subtypes (HI and H2 respectively) and possess very weak homology, thus the difference between them is more pronounced. (B) The results of comparison for any pair of genomes T1 and T2 by the average complexity in most cases does not depend on the window size D, that is, one of the two conditions CTI(D) < CT2(D) or CT,(D) < CT2(D) holds for almost all D (the principle of monotonicity). An analogous statement (but with the larger number of exceptions) can be done for the values of s. Breaks of monotonicity for some D are diagnostic for the existence in the genome of characteristic structures of this size. (C) The comparison of fragments for which both C1 (rows 17-20 in Table 2) and C2 (rows 21-24 in the same table) analyses were applied demonstrate that not only C2(D) < CI(D) for a fixed fragment (which is natural), but s2(D) < sl(D). This means that increasing the number of allowed operations in the definition of the measure C2 leads to the decrease in scattering of the complexity values in the profile. Comparison of values of s for a genome and its random copies leads to another observation useful for classi~cation, namely, sgenome(D) > S~dom(D) (see Tables 4 and 5). (D) Anomalously low values of Groin for a fixed D indicate the existence of a periodicity with the step comparable to D or a structure with a shorter step but with a large number of repetitions. There exists a negative correlation between Cmin and s: the lower is Cmin, the greater, as a rule, is s. (E) Parameter Cmax is the least informative one among all considered parameters. Genomes do not differ much as regards Cmax from each other and from random texts (especially for small D). A question arises about its theoretical limit for a fixed D. This problem is closely related to the existence and estimate of the complexity of de Bruijn sequences (de Bruijn, 1946). It can be demonstrated that on the sequences from this class C1 reaches maximum. In particular, for window sizes D = 20, 50,100, 150 considered in Tables 2, 3 and 4, extreme values of Cm~x are 18, 33, 52 and 69 respectively. Analysis of these tables shows that for small D (e.g., D = 20) the extreme values are reached both in some genomes and in random texts. For large D this does not happen, i.e. supercomplex fragments occur neither in genomes, nor in their random analogs. This indicates that the notions of randomness and complexity are close, but not identical. (F) The difference between genomes and their random analogs as regards C' and Cmin increases with D (Table 3 and 4). This rule, but in a less pronounced form, is true also for the genomes themselves (excluding related ones). Thus for the taxonomy of genomes by their complexity characteristics, large windows should be used. (G) Among genomes listed in Table 2 the complex ones are MS2, T7, G4, POLIO1, while relatively simple ones are SV40 and BKV, Epstein-Barr (EBV) and vaccinia viruses, as well as fragments of human sequences (rows 19 and 20 of Table 2). Among factors causing decrease of complexity one should mention substantial divergence from uniformity of frequencies and positional frequencies of nucteotides, correlation between neighboring nucleotides, and existence of periodicities with the step comparable to D. The above observations demonstrate that complexity histograms retain many useful information about the genome as a whole, and can be used for the classification purposes, since related genomes are close in the parameters space for varying window size. In the accompanying paper (Gusev et hi., this volume) we have already mentioned local manifestations of fractality in genetic texts. Representation of genomes as point sets (Jeffrey, 1990) allows one to pose a question of the fractal dimension of these sets. Estimates of it can be useful in considering the global manifestations of fractality in a genome. Parameter C'(D) might be quite useful in this regard, since for stationary sequences Ziv and Lempel (1976) have demonstrated convergence of C(D)log2(D)/D as D increases to the Shannon entropy, which, in turn, is closely linked to the Hausdorf dimension (Grassberger, 1989) having a key role in characterization of objects of the fractal nature.

204

Table 2 Parameters of the complexity histograms 1 - 2 0 ) a n d (72 ( r o w s 2 1 - 2 4 ) a r e u s e d .

f o r t h e s l i d i n g w i n d o w size D = 20, D = 5 0 a n d D :

I

No.

Genome

Length

D = 20

2 3 4 5 6 7 8 9 I0 11 12 13

Q0X174 (complete) G4 (complete) BY40 (complete) B ~ (complete) BKVDUN (complete) Wg2 (complete) M029 (complete) POLIO1 (complete) HUnT (complete) BOVNT (complete) ASVY73 (complete) FBJ]~J'SV (complete) T7 (complete)

14 15

5386 5577 5243 4963 5153 3569 3779 7440

16569 16338 3718

4226 39936 48502

(complete) EBV (complete)

16

172282 191737

D = 100

Species

Omit,. 1

D = 50

I00. Measures

0

Oral&:: S

Cmin

0

C Lx S

Cmln

0

(]max S

7

13,77 18 1,36

21 26,09 33 1,57

36 42,42 49 1,85

8

13,84 17 1,36

20 26,21 31 1,60

35 42,51 48 ! , 8 9

7

13,07 17 1,54

16 24,68 31 1,97

29 40,00 47 2,51

6

12,98 17 1,59

16 24,55 30 2,07

29 39,86 47 2 , 4 4

6

13,01 17 1,58

16 24,61 31 2,13

22 39,87 47 2,73

9

13,93 18 1,28

21 26,48 32 1,52

35 43,15 50 1,93

AVIAN MYELOCYTOMATOSIS VIRUS M029 POLIOVIRUS TYPE1

7

13,18 18 1,77

14 25,13 31 2,53

26 41,06 50 3,36

9

13,68 18 1,33

19 26,02 31 1,67

35 42,53 50 1,88

HUMAN MITOCHONDRION BOVINE MITOCHONDRION AVIAN SARCOMA VIRUS Y73 IURINE 0STEOSARCOMA VIRUS PBJ BACTERIOPHAGE T7

7

13,16 18 1,50

16 25,10 31 2,01

32 41,10 49 2,39

7

13,30 18 1,48

18 25,21 31 1,94

33 41,17 48 2,32

7

13,46 18 1,61

13 25,70 31 2,33

27 41,86 48 3,04

4

13,26 18 1,62

17 25,24 31 2,09

23 40,46 48 3,79

7

14,03 18 1,37

13 26,31 31 1,79

27 42,53 49 2 , 2 0

BACTERIOPHAGE LAHBDA EPSTEIN-BARR VIRUS

7

13,65 18 1,42

16 25,85 32 1,77

31 42,08 49 2 , 1 6

5

13,II

18 1,61

6 24,85 32 2,47

8 40,43 49 3,84

Vaccinia virus

4

f 3 , 2 0 18 1,95

6 25,16 32 1,98

21 40,74 49 3,08

BACTERIOPHAGE PHIX174 BACTERIOPHAGE G4 SIMIAN VIRUS 40 HUMAN PAPOVA VIRUS BE (HN STRAIN) HUMAN PAPOVA VIRUS BE (DUNLOp STRAIN) HAOTERIOPHAGE H82

17

FLP834HAO (fragment)

1778

INI~LU]~ZA VIRUS, PUERTO RICO, I{~AGGLUTININ (HINI)

8

t3,41

17 1,51

18 25,37 30 2,OO

33 41,36 47 2,42

18

FLJ357BA (fragment)

1773

INFLUENZA VIRUS, JAPAN, H~AAGGLUTI NIN (H2N2)

8

13,41 17 1,44

21 25,56 31 1,81

37 41,76 48 2,07

19

~ B (fragment)

2165

HOMO s A P z ~ s

7

12,98 17 1,52

18 24,76 30 2,06

33 40,31 47 2,26

2O

~ E (fragment)

4805

GLOBIN GENE AND FlaNKS ) HOMO SAPIENS (]~BRIONIO EPSILONGLOBIN G~IE & 2 ALU

2

12,99 17 1,80

9 24,79 30 2,17

30 40,61 47 2,31

21

see no. 17) measure C 2

(see no.

17)

7

11,14 14 1,15

15 20,49 25 1,51

27 33,6I

22

see no. 18) measure C 2

(see no.

18)

8

11,15 14 1,11

16 20,73 25 1,51

28 33,84 39 1,78

2q

see no. 19} measure C2

(see no.

19)

7

10.93 14 1,17

16 20,18 25 1,56

27 32,91 39 2,00

24

see no. 20) measure C2

(see no.

20)

2

10.90 14 1,30

9 20,23 25 1,70

26 33,13 38 1,80

FANILY

(S~A-

SEQUENCES)

38 1,69

C1 ( r o w s

205

Table 3 P a r a m e t e r s of the complexity histograms of the phage A a n d 10 r a n d o m texts with the similar base content. T h e window size D : 20. C0 is the value of C for which the m a x i m u m of the h i s t o g r a m is obtained.

Text

'~mizt

Phage

7

~n~Jle ,

U

(]max

S

00

h (C O )

13,65

18

1,42

14

13669

=

- -

A~a~d.

8

13,94

18

1,33

14

14586

rand.

8

13,95

18

1,35

14

14402

A~a~d.

7

t3,89

18

1,35

14

14524

A4a~d.

8

I3,88

18

1,32

14

14596

A~a~d. •

8

13.89

18

1,35

14

14330

A~d.

7

13,91

18

1,35

14

14276

A?ra~d.

JB

13,93

18

1,34

14

14280

A~and.

8

13,92

18

1',34

14

14516

A~a~d.

8

t3,91

18

1,35

14

14349

10 Ara~d.

8

13,88

18

1,32

14

14598

Table 4 P a r a m e t e r s of the complexity histograms for the window size D = 150. Other details as in Table 3.

Tixt

Oraln

U

C

S



h (Co ) ,i

Phage ) 44

56,3

65

2,48

57

8127

A~amd" 49

58,04

66

2,09

58

9179

A2and. 49

58,10

66

2,01

58

9658

A3ramd" 51

58,09

66

2,06

58

9362

A4r~nd. 51

58,01

65

1,97

58

9629

A5~nd, 49

58,24

66

2,02

58

9580

A6rand, 50

58,03

66

2,03

58

9477

lTrand. 51

58,12

65

1,97

58

9626

ASrand. 50

5-~,06

65

2,00

58

9695

lgrand, 50

58,02

65

2,05

58

9240

10 Aramd. 50

58,05

65

2,01

58

9274

geneme

206 2. D e l i n e a t i o n o f links b e t w e e n local fragments

2.1. Search for homologous fragments by the structural landmarks The local homology search is one of the most labor-consuming procedures of the analysis of genetic texts. In order to solve it, algorithms of the dynamic programming are employed with the quadratic dependence on the text length of time, and, what is more crucial, memory. These algorithms cannot be used for the analysis of long genomes due to the memory restrictions. This. obstacle is partially eliminated in approximate algorithms of the local homology search. Usually it is assumed that each pair of homologous fragments contains a sufficiently large conserved core (e.g. as a perfect repeat). The homology search is reduced to the search of all cores (repeats) of a given length and the choice of pairs that allow extension on the sufficiently high homology level. A natural generalization of the above approach is the choice of candidates not by the existence of a common core, but by some common structural property; Classification of such properties that can be discovered by the complexity analysis is presented in the accompanying paper (Gusev et ah, this volume). Note that the structural homology is not necessarily accompanied by a homology in the usual sense. But is they coincide, the fact of existence of common structural features in the discovered homologous fragments is an additional argument of their functional importance. In order to find structural homologies, one should describe fragments of anomalous complexity in the language of structural regularities. The description includes the features that play the main part in the fragment anomality (longest repeats, stem-loops, symmetries, short periodicities with step 3 and larger, integral features of the type '(C,A)-rich', 'T-poor', 'series of (non-A)-elements' etc.). Fragments with close structural descriptions are the best candidates for the usual homology identification. Since the number of such fragments is not large, it is possible to apply to them more labor-consuming procedures, in particular, the dynamic programming. Some examples of this approach were presented in the accompanying paper (Gusev et al., this volume). Thus, the link between extensive homology zones 2 and 2' was established by the structural linking of the common core (CAG) 3 coding in both cases for glutamine clusters (Glu) a. Analogously, zones 3 and 3' are linked by a pair of homologous fragments with alanine cores (Ala) 2 and (Ala) 3 (the homology on the amino acid level is approximately 60%). Consider some more examples illustrating the diversity of structural homology variants. E x a m p l e 1. Two purine runs with distanced repeats AAGGA are marked as anomalous in hemagglutinin segments JAPAN (type A) and Lee (type B). Comparison of them demonstrates that the structural homology ((A,G)-richness and common repeats) is accompanied by the homology in the ordinary sense both on the amino acid and nucleotide levels: JAPA~I, pos. 1093: AGAAGGAGGATGGCAAGGAATGG Lee,

pos. 1145: GGAAGGAGGATGGGAAGGAATGA GluGlyGlyTrp GlyMet

The homology is retained after extending of both fragments on approximately 30 nucleotides in the 5' direction (up to the beginning of the HA2 subunit) and on approximately 20 nucleotides in the 3' direction. Thus this structural feature (a purine run) allows us to discover long homologous fragments in two rather dissimilar hemagglutinin segments (belonging to different types).

207 E x a m p l e 2. An interesting structural homology based on a multiple repeat of an elementary palindrome T ~ - A is provided by two hemagglutinin fragments of the influenza virus (strain P, subtype H1, pos. 67 and 1710). As noted in (Gusev et al., this volume), such repeats lead to formation of alternative stem-loop structures: ---> --><-.... >

<--<. . . .

. . . . >< . . . .

HA, PR,

pos. 67:

>>

° °

° .

. ° . °

. °

. .

.

. °

. . . .

. °

.

. .

. .

---> ---> HA,

PR,

pos.

(subtype

Hi)

1710:

of

HA1

. . ° .

. •

<---

<---

<<<

end of HA2

TTGCAG-TGCAGAATATGCATCTGA

:::::

:::::

::

.... > .... >

HA, JAPAN, p.1709: ( s u b t y p e H2)

start

TTGCAGCTGCAGA---TGGAGA

:::::

:::

<. . . .

< ....

<<<

CTGCAG-TGCAGGATCTGCATATGA

Cys

Cys

It is interesting to note that both structurally similar fragments of HA (P) are situated on approximately same distances from the termini of the HA segment (its length is 1779 nucleotides), one in the beginning of the HA1 subunit, while the other in the beginning of the H2 subunit. This confirms the tendency of anomalous fragments to be situated on the boundaries of structural regions. Moreover, the fragment in the end of HA2 is coding for two cysteines and is evolutionary stable, since it is marked as anomalous in many outgroup (strongly diverged) hemagglutinin strains (in particular, HA (JAPAN) in the third row of the above alignment). E x a m p l e 3. In polymerase segments 1 and 2 of the influenza virus (strain P) inside anomalous zones we discovered structurally similar fragments, namely, palindromes flanked by T G A T repeats: ......

>

<. . . . . .

P1, pos.

1152:

TGATTTGAAAT--ATTTCAATGAT

P2,

1458:

TGATTTCCAATTAATTCCAATGAT

pos.

-> - - - > < - - <<-->< . . . . > <. . . . ><--> Such flanking often is a signal of recombination events. However, two polymerases display no homology.

2.2. Determination of the repeat structure in long genomes Some long genomes (e.g. the Epstein-Barr and vaccinia viruses) possess a complicated structure formed by non-random tandem and distanced repeats of various lengths. Their determination based on the prefix tree technique is difficult since construction of the corresponding trees and graphs requires large memory. The use of the complexity profiles often allows one to obtain the necessary information about repeats even if the memory is restricted. Two considerations form the base for such analysis. (1) If the window size D exceeds the period length, then the complexity in the corresponding region is

208

anomalously low (even when compared to the threshold C' - 3s). This effect is stable relative to distortions (small insertions between periods, extensions and breaks of the basic unit), while ordinary methods of periodicity analysis usually fail in such situations. (2) If repeated fragments have the length strongly exceeding D, then irrespective to tandem or distanced character of these repeats (cf., in particular, long terminal repeats in some genomes) the probability of anomalous zones within these fragments is large. Usually the movement of the window of size D inside an anomalous zone is repeatedly accompanied by the situations when Ci < C' - 3s, i.e. a n o m a l o u s values are clustered in the profile. Discovery of similar clusters is a signal of the possibility of coincidence between the regions containing these clusters. This observation should be considered as a suggestion for rarefying of the complexity profile by fixation of extreme values and search for repeats in the so processed sequence. The number of fixed extreme values is regulated by setting of the corresponding threshold. Varying the threshold, it is always possible to guarantee the discovery of corresponding chains of extreme values in each of long repeated fragments. We illustrate the clusterization effect by the Epstein-Barr virus genome. Consider the complexity profile obtained with the window size D = 75. In accordance to the rule of three standard deviations, in this case anomalous are the values Ci < 23. Let il, i2,... , ik (ik < N - D + 1) be genome positions which correspond to anomalous complexity values. Each of them we will characterize by a pair of numbers (Ck, Ak), k = 1,... , K, where Ck = Cik is the complexity value for the position i k , A k ---- ik -- i k - 1 is the distance between the current position and the previous anomalous position (we assume A 1 = il). In begins in the genome position ik = 50616. The basic unit is Ck : 23 23 23 22 23 23 23 23 22 21 22 23 23 23 Ak : 1 1 6 1 2 1 1 1 1 1 1 1 37 70 This unit repeats 12 times in a row. The distance between the first and the last positions in the unit is 125. Thus we have the 12-fold periodicity of this length (plus one incomplete unit). Analogously other repeats of length from hundreds to thousands of nucleotides can be delineated in the genome of EBV. An advantage of this method as compared to the traditional ones is the reduction of the repeat search in the initial (very long) sequence to that in a shorter sequence of a n o m a l o u s values (Ck, Ak), k = 1,... , k. The number of anomalous values K is lower that the genome length by several orders of magnitude. Thus we avoid the problems with the storage space, complexity of the prefix tree construction and graphs of words, their sensitivity to the distortions in the periodical structure etc. 2.3. S e a r c h f o r i s o m o r p h i c

segments

In this section we consider the search for specific links between arbitrarily distanced fragments of a text that in general are not of the anomalous complexity. The specifics of these links is that one of them can be obtained from the other by a n o t f i x e d a p r i o r i renaming of the alphabet symbols and, possibly, the change in the reading direction. Such links can be introduced as f - r e p e a t s defined in (Gusev et al., this volume). Here we use the term i s o m o r p h i c f r a g m e n t s as the one preferable from the semantic point of view. Two fragments are called isomorphic (or possessing similar complexity structure) if (i) their lengths coincide; (ii) the length vectors for the complexity components (and thus the complexity values) are similar for both fragments; (iii) the vector of the copying pointers coincide. For instance, two sequences $1 = a • b . bb • abb • c . abbc and $1 = c • a • a a • c a a • b . c a a b possess similar complexity structures, while the complexity structure of $ 3 = a • b • bb • abb • a • abba is different, since the copying pointer for the fifth component equals 0 for $1 and $2, while it is 1 for $3. Since the processing is performed in two directions (left to right and right to left), it is possible to fix pairs of fragments of similar complexity structure but different direction. In particular, this scheme allows

209

one to find mirror and complementary palindromes, distanced symmetries and stem-loop constructions. An advantage of this method of search for isomorphic fragments is its universality, since the renaming (that is, one-to-one mapping f : E ---. E) is not fixed beforehand (for large alphabets the number of possible renamings is large and the direct search over all renamings is not effective). However, for all possible f the effective determination of all pairs of fragments, that form repeats in the above sense, is guaranteed. Let S be the text being analyzed, and let D be the lower threshold for lengths of isomorphic fragments of interest. The idea of the method is to use the results of the computation of the complexity profile with the window size D in order to divide the complete set of fragments of S into non-intersecting subsets so that elements of any such subset could form l-repeats only with each other, but not with elements of other subsets. It allows one to avoid unnecessary comparisons and substantially decreases the number of operations. The first step in the subdivision of S is performed with regards to the length vectors of the complexity components. The resulting subsets are further subdivided with regards to the vectors of copying pointers. Each obtained pair of isomorphic fragments is extended if possible. Thus all isomorphic fragments of length D or more are determined. The effectivity of the algorithm is O (g(log 2 D/[E[+ D)), the required memory is O(ND). More detailed description of the algorithm is presented in (Gusev and Chupakhina, 1991). In the latter work we also suggested a scheme of classification of isomorphic fragments by the reading direction of the elements in the fragments forming an f-repeat, localization of these fragments, the type of the mapping

f. Experiments, in which isomorphic fragments were studied, were performed with several genomes and their random analogs, in particular, with the genome of the phage A (Gusev and Chupakhina, 1991). They had several objectives. First, we were interested, whether the extension of the class of repeats leads to discovery of non-random repeats of novel types, differing from the ones usually encountered in genetic texts. Second, we studied, which types of the mappings f prevail in genetic texts. Third, the possibility of the existence of an inner structure in isomorphic fragments was of some interest. The answer to the first question was negative. We did not discover in natural genomes isomorphic structures of lengths significantly different from the lengths obtained in random simulations of genome analogs. This result is not surprising, since would the contrary happen, it would be reasonable to suppose the existence of some unknown interactions (similar to the one causing the complementary base-pairing), which is hardly probable. The answer to the second question we illustrate by the numerical data obtained in the analysis of the phage A genome. Out of 77 isomorphic fragments of length 15 or more nucleotides, we discovered 20 direct and 57 inverted ones, 24 ones were tandem (of the palindrome type) and 53 were distanced. Out of 24 possible permutations (mappings f : E ~ E) almost all were encountered. The most often permutations were the following: C G

-

10 times

(1)

G T

-

7 times

(2)

-

5times

(3)

A C

-

5 times

(4)

G C

-

5 times

(5)

f 3 = ( A GG C T )

210

f6 =

T A

-

5 times

(6)

fT=

(A G C T ) C G

_

4timesetc.

(7)

Note two peculiarities of the distribution of the permutation frequencies. (1) All frequent permutations can be represented as superpositions of the cycles of length 1 or 2, i.e. either f ( a ) = a (an identity mapping) or f ( a ) = b, f ( b ) = a (pairwise substitution), a, b e E. Existence in frequent permutations of 1-cycles implies that many isomorphic fragments are imperfect repeats and palindromes: (AGCT~ (i) f = ~T~CAJ: gene V, p o s .

9233:

gene H, p o s .

12212:

GlnAlaLeuLeuAla GCAGGCGCTGCTGGCG GCTGGCGCAGVAGGCG LeuAlaGlnGlnAla

[AGCT~

(ii) f = ~AGTCJ: =-->

orf-204,

p o s . 42615:

==>

<==

<_-=

AAACAGAAAGATAAA lysGlnLysAspLys

(2) Almost all frequent permutations allow a natural explanation. Thus the analysis of the 10 fl repeats demonstrated that most of them are stem-loop structures, with stem consisting predominantly of G and C, and loop mainly formed by A and T ( f = (AGCT~h. \ A C G T ] 1" ......

gene Jlu3, pos. 5480:

>

< ......

T[GCGGCAGAAAACAGCCGCJA AlaAlaGlyAsnSerArg

The mappings f2 and ]'4 fix purine and pyrimidine substitutions. These are the most wide spread types of evolutionary substitutions. The identity mapping f5 characterized ordinary repeats and mirror palindromes, while ]'7 describes the complementary base-pairing. Substitutions with cycle length exceeding 2 are rare, while their interpretation is difficult, as exemplified by the following distanced inverted f-repeat with the cycle'length 4 ( f = (AGCT~ non-coding regions): ~GTAC/' <><>

<><>

pos. 24110: AGAGTTGTGGCTTGGCT <---><--->

<><>

<><>

pos. 29255: CATTCCATTCTCCTGTG <---><--->

Here one should note the structural similarity as regards the existence of short periodicities and the mode of their alternation. Turning now to the third problem, we emphasize that the regularities of the inner structure of isomorphic fragments can be considered on two levels: nucleotide and amino acid. The nucleotide level is usually taken into account when an f-repeat is situated in a non-coding region of a genome (see the above example). Here the structural elements are short periodicities and palindromic and stem-loop constructions.

211 Structurization o f / - r e p e a t s on the amino acid level manifests on the characteristic positioning and content of amino acids. Despite the fact that the number of analyzed f-repeats was small, their clusterization to several subtypes was observed. In particular, the regular sequence XYXZX where X is lysin, while Y and Z are arbitrary (often charged) amino acids (see example (b)) was encountered thrice in 77 f-repeats, the regularity X2YX occurred four times, but the amino acid X was different in different f-repeats, regularities (XY) 2 and XY2X occurred twice each etc. Analysis of isomorphic fragments in random (mixed) copies of the phage ~ demonstrated the following differences with the genome itself: (i) the number of tandem (palindrome type) f-repeats was significantly larger; (ii) the most frequent was the identity mapping (perfect repeats and symmetries with the latter prevailing). Summarizing, we note that the most informative are the ratio between the number of tandem and distanced f-repeats and the distribution of the permutation frequencies. A substantial deviation from uniformity is observed both as regards the use of particular permutations and their cyclic structure (the most often used permutations contain only cycles of lengths 1 and 2). The composition of these permutations allows one to make conclusions about the typical constructions occurring in texts and the character of evolutionary changes.

3. N o n - s t a n d a r d possibilities o f t h e c o m p l e x i t y a n a l y s i s 3.1. Prediction of signal-like constructions Above we repeatedly emphasized the accunmlating power of the complexity characteristics, having in mind its ability to react to a large number of particular regularities. Exactly because of this universality the fragments of anomalous complexity can have multiple interpretations. In particular, they can be caused by the existence of a punctuation mark (regulatory site), the change of sequence properties at the boundaries of structural segments, and the regularities on the protein level. In this section we consider a possibility to tune at a specitlc type of regularities (naturally, accompanied by the loss of generality). Assume that we desire to tune at prediction of some particular functional sites (e.g. promoters), information about which is given by a learning sample To = T1 * T2 * ... * TM, where Tm (1 < m < M) are promoters from natural genomes, * is a delimiter. Let T be the genome being analyzed, and let D be the characteristic size of a promoter. Complexity profile of the text T by the text To is the sequence P(T, To, D) = C1 (T, To), C2(T, To),... , CN-D+I (T, To), where Gi (T, T0) is the complexity of a fragment [i : i + D - 1] decomposed by the text To (1 < i < N - D + 1, N = ITI). More exactly, a fragment T[i : i + D - 1] should be divided into blocks T[i : i + jl - 1 ] T [ i + jl : i + j2 - 1 ] . . . T [ i + j k _ l : i + j~ - l].. .T[i + jn-1 : i + D - 1 ] , such that the size of the k-th block (1 < k < n) is defined by the relation j k - - j k - 1 = max~lp I T [ i + J k - 1 : i + j k - 1 ] = T o [ p : P + l p - 1 ] } . p
212

based on a computer experiment with random copies of T; (iii) based on the bootstrap technique when we delete from To sequences one by one, decompose them by the remaining sequences and estimate the scatter of the parameter of interest. The use of (ii) and (iii) allows one to reach a reasonable compromise between the first and the second type errors. Then marked promoter-like sites are subject to the grammatical analysis which uses the relative order of informative l-grams, distances between them, their distribution and frequencies in the learning sample etc. Then the final decision "sign / non-sign" is made. Main advantages of this approach are the following: (1) It is not necessary to align elements of the learning sample, which usually precedes the construction of a consensus or a weight matrix. It is difficult perform if a priori information about functionally similar positions (e.g. initiation or termination points of the corresponding genetical processes) or conserved functional cores is absent. (2) It is not necessary to perform a taxonomy analysis of the learning sample, which is usually caused by its non-uniformity. The use of taxonomy leads to the "multi-consensus" description of the learning sample, what partially decreases the information loss. However, in this case small taxons (represented by anomalous signals) often are purposefully ignored. By our observation each learning sample of signals of known types contains up to 20% of such elements. The suggested method allows one to take into account anomalous signals according to their weight obtained in the bootstrap procedure. (3) There exist various deviations from the direct deciphering scheme implemented in the above method, that requires a sufficiently representative learning sample. Among arising schemes of inductive generalization of regularities present in the learning sample, we can note, in particular, the possibility to allow a restricted number of mismatching symbols ill the decomposition components.

3.2. Compression of genetic texts Ziv and Lempel (1976) introduced their complexity measure for the effective compression purposes. Each component in the scheme of a sequence formation is represented by a triple: length L, copying pointer p and an additional symbol generated after each copying event (in our scheme (Gusev et al., this volume) the element is generated only if p = 0 when he occurs in the text for the first time). Parameter p is encoded by [log 2 N] bits (N is the text length). In order to encode L, in (Ziv and Lempel, 1976) it was suggested to restrict the maximum component of decomposition L < lm~x (which is equivalent to restricting the maximum length of repeats). Then each component can be represented by [log~ Lm~x] bits. Since the expected length of the maximum repeat is of the order log N~ log ]El, the main memory volume is occupied by the copying pointers. We suggest to develop the Lempel-Ziv technique in three directions. The first one is related to the use of the measure C2 instead of C1. Table 1 demonstrates the decrease of the number of encoded components in this case. Additional memory necessary to encode the type of the copying operation is 2 bits per decomposition component. The second direction is related to the possibility to avoid the restriction on the maximum repeat length, which is usually unknown. Moreover, the existence of very long repeats in some genomes can make it unnecessary to encode components of maximum allowed length related to this repeat. The well known in the coding theory coding of exclusions technique can be applied here in order to eliminate this restriction. The third possible development is the preprocessing of very large genomes in order to detect the structure of repeats. Long texts are usually encoded by the sliding window technique with the maximum possible window size D determined by the memory resources. This scheme is more convenient than breaking of the text into non-overlapping fragments of length D and independent compression of each fragment. However, the sliding window method does not allow one to use long repeats not occurring simultaneously in some

213 window. The preprocessing based on considerations of section 2.2 can detect and employ such repeats even if the memory is restricted. 3.3. The search for clustered structures A clustered structure is an f-repeat (see definition of the measure 6'2 in (Gusev et al., this volume)) with the restriction on the distance between fragments forming the repeat. Let d be the maximum allowed distance between beginnings of the fragments forming a repeat, and L0 be the lower threshold of their length. We are interested in an algorithm searching for repeat, symmetry and stem-loop structures satisfying the above restrictions. Note that the use of general algorithms in this case in unreasonable, due to high memory requirements and the necessity to delete spurious repeats (not satisfying the restrictions). The algorithm for computation of the complexity profile by the measure C2 is a convenient tool for solution of this problem. If the window size is D = d + L0, then occurrence of decomposition components with the length equal or exceeding L0 is related to the existence of clustered structures satisfying the above conditions. Note that the fragment is not necessarily of anomalous complexity. Actually, a novel definition of anomality is introduced: now anomalous are fragments containing components of sufficiently large length (L > L0). A complicating factor is the possibility of masked long components, namely, components that can reach the prescribed length only after extension to the left (see Example 1 in (Gusev et at., this volume)). Thus the fragments formally not containing components of length exceeding L0 can in practice contain clustered structures of the given size. This rare situation can be simply avoided.

Conclusion We have suggested a simply expendable set of complexity characteristics of a genuine and performed a comparative analysis of various genomes with regards to these characteristics. The characteristics possess important for the classiilcation purposes properties, namely, (1) their values are close for related species and do not depend on a genome length; (2) when the set of characteristics is extended, their concordant behaviour persists and the differences between genomes and their random analogs increase. We have considered methods of discovering of structural links between local fragments arbitrarily distanced along a genome. The first method reduces to determination of fragments of minimum complexity, their description in terms of structural regularities and the choice of fragments with close structural characteristics. Creation of a bank of structural regularities of genomes and the search for structural homologies is a perspective direction of analysis of strongly diverged genomes. The second method allows one to discover in genomes large regularities similar to the long terminal repeats and periodical structures. It has some computational advantages as compared to the traditional approaches. The third method is intended for discovering of isomorphic segments, i.e. fragments coinciding up to a renaming of the alphabet symbols and, possibly, change in the reading direction. Particular cases of isomorphic fragments are ordinary fragments, mirror and complementary palindromes, distanced symmetries and stem-loop constructions. Isomorphic fragments carry important information about functional and evolutionary organization of texts. Among perspective applications of the complexity approach is the possibility to tune to the prediction of punctuation marks of a given type, compression of genetic texts by the measures C1 and C2, and the possibility of the fast search for structures with the distance between elements not exceeding a fixed threshold.

214

References De Bruijn, N.G., 1946, A combinatorial problem. Proc. Kon. Ned. Akad. v. Wet. 49, 758-764. Grassberger P., 1989, Estimating the information content of symbol sequences and efficient codes. IEEE Trans. Inform. Theory IT-35, 669-675. Gusev, V.D. and Chupakhina, O.M., 1991, Search and classification of text fragments with similar complexity structure, in: Analysis of Time Series and Symbol Sequences (Computer systems, vol. 141) (Novosibirsk), pp. 25-45. Gusev, V.D., Kulichkov, V.A. and Chupakhina, O.M., The Lempel-Ziv complexity and local structure analysis of genomes. (This volume). Jeffrey, H.J., 1990, Chaos game representation of gene structure. Nucl. Acids Res. 18, 2163-2170. Ziv., J. and Lempel, A., 1976, A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory IT-23, 75-81.