Physica A 341 (2004) 471 – 481
www.elsevier.com/locate/physa
Inference of phylogenetic distances from DNA-walk divergences P. Licinioa;∗ , R.B. Caligiorneb a Departamento
b Centro
de F sica, ICEx, UFMG - C.P.702 - 30.123.970 Belo Horizonte, MG, Brazil de Pesquisas Ren e Rachou, FIOCRUZ - Av. Augusto de Lima, 1715 - 30.190.002 Belo Horizonte, MG, Brazil Received 3 March 2004; received in revised form 25 March 2004
Abstract A formalism for the analysis of DNA-sequences is presented. It develops the concept of a composition vector potential which incorporates intrinsic double-strand and four-base symmetries. The vector potential allows for straightforward coarse graining and graphical representation of whole genomes. Its projections are mapped onto DNA-walks. It is shown that distances due to mutation between sequences can be estimated from mean square di4erences between walks. A computer program for global alignment of sequences (DNAWD) is thus developed and applied to a set of toy sequences representing evolution under mutation. The distance matrix output of DNAWD is shown to provide a good estimate of the associated phylogenetic tree. c 2004 Elsevier B.V. All rights reserved. PACS: 87.10.+e; 87.23.Kg; 89.75.Hc Keywords: DNA-walk; Phylogenetics; Global alignment; Composition potential
1. Introduction Physicists introduced DNA-walks as a tool for the statistical analysis of freshly sequenced, long DNA segments in the last decade [1]. The original interest and much of the subsequent debate regarding this technique centered on the nature and origin of both long- and short-range correlations found in these walks [2–4]. Further analysis of walk structure revealed clear and meaningful correlation to biological expression, as shown for example by walk–drift inversion at bacterial genome ORI and TER regions, ∗
Corresponding author. Fax: +55-31-3499-5600. E-mail address:
[email protected] (P. Licinio).
c 2004 Elsevier B.V. All rights reserved. 0378-4371/$ - see front matter doi:10.1016/j.physa.2004.03.098
472
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
associated to di4erential bias pressure at leading and lagging strands during replication [5,6]. Three phase walks were introduced by Cebrat and coworkers in order to incorporate and distinguish amino acid coding periodicity into the walking scheme [7]. Indeed di4erent drift-trends are found for each phase. A partial explanation for this phenomena can be found in the codon usage frequency, particularly in the high third-base degeneracy, which frees this base composition from protein Etness constraints. The intrinsic connection between DNA-walk geometry and sequence composition makes this kind of representation a powerful tool for sequence analysis. Di4erent conversion schemes and formalisms have been adopted, such as four-dimensional [8], pyramidal [9], tetrahedral [10,11] bi-dimensional [12] and linear [13] walk representations. Walks have been employed as a basic tool for the analysis of long-range correlations [1], detection of coding–noncoding regions [14], statistical modeling of genome structure [15] or study of individual genes [16]. Although physicists are historically given to geometrical representation of variables, the bioinformatics community has most uniformly adopted the 4-letter representation of nucleic base sequences. The latter form, although still appropriate for electronic storage and algorithmic analysis, is clearly cumbersome to display except for very small segments. It is the purpose of this paper to formalize a useful framework connecting and stressing the equivalence in these representations. The approach provides a physico-chemical interpretation for a generalized tetrahedral walk representation, by deEning a composition vector potential, which links DNA-sequence composition to walk geometry. The main advantages of the walk representation may then be realized, such as the ease of application of renormalization procedures and other theoretical treatments inspired from physics. As an example of a new use for this tool in phylogenetic analysis, we describe how DNA-walk di4erences can be used to compare homologous sequences and measure their mutual evolutionary distance resulting from mutation. This paper is organized as follows. First the concept of a composition vector potential which maps into DNA-walks is developed. In the following section, the proper metric of the vector potential or DNA-walk space is taken for sequence comparisons. The e4ect of mutation on walk geometry is analyzed and a statistical mutation distance measure is deEned. Finally this scheme is applied to the comparative analysis of homologous sequences. Toy sequences are submitted to successive duplication and mutation events and then compared with the aid of a computer program [DNAWD] 1 that gives the statistical mutation distance between sequence pairs. 2. Vectorial composition analysis SpeciEc or relative composition of a quaternary system such as DNA has three degrees of freedom and can thus be fully speciEed as a point inside the volume of a tetrahedron. In this spatially isotropic representation, each of the four tetrahedron 1
DNAWD stands for DNA-walk divergence analysis program. It is a routine for the pairwise comparison (producing the best alignment and corresponding 2 following Section 3 of this paper) of multiple DNA sequences from either GenBank or user text Ele input. It is freely available upon request to the authors.
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
473
Fig. 1. Tetrahedral DNA base-director set ˜a;˜t;˜c; ˜g for DNA-walk construction. The orthonormal set ˜x; ˜y;˜z is shown. The 45◦ sub-set ˜x ; ˜y deEnes the symmetry projection plane.
√ directors of modulus 3 pointing to a vertex ˜a, ˜t, ˜c, or ˜g is assigned to one of the corresponding composition species A, T, C, G (Fig. 1). Composition can thus be represented by the vector ˜ = A˜a + T˜t + C˜c + G˜g ;
(1)
where A = NA =N is the concentration of species A in an ensemble of N molecules and so A + T + C + G = 1 :
(2)
Concentrations can be obtained from projections of the composition vector and so be calculated from trigonometric relations as for example ˜ · ˜a = 4A − 1 ;
(3)
1 ˜t · ˜a = ˜c · ˜a = ˜g · ˜a = − ˜a · ˜a = −1 : 3
(4)
since
It can also be shown that orthogonal projections into the tetrahedron faces and edges reJect the relative compositions of the corresponding ternary and binary sub-systems. Since the tetrahedron belongs to a cubic system (its edges can be inscribed into the face diagonals of a cube), it is useful to deEne a symmetric orthonormal set of axes ˜x, ˜y, ˜z (directed towards the centers of the tetrahedron edges—Fig. 1) as ˜x =
1 (˜a − ˜t − ˜c + ˜g) ; 4
474
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
˜y =
1 (˜a − ˜t + ˜c − ˜g) ; 4
1 (˜a + ˜t − ˜c − ˜g) : 4 Inverse relations and projections follow similar to ˜z =
(5)
˜a = ˜x + ˜y + ˜z
(6)
˜x · ˜a = ˜x · ˜g = −˜x · ˜t = −˜x · ˜c = ˜x · ˜x = 1 :
(7)
and The z-axis is of particular signiEcance for genome analysis due to DNA strand complementarity, as will be discussed below. Projections onto this direction represent the excess concentration of weak over strong binding bases Z ≡ ˜ · ˜z = A + T − C − G = (1 − RC )=(1 + RC ) ;
(8)
where RC = (C + G )=(A + T ) is the Charga4 ratio, a species-speciEc composition characteristic value. Equivalently, the x-projection measures the excess concentration of pyrimidines over purines. Of course, from the orthonormal projections one also obtains any other concentration by linear combinations such as 1 A = (X + Y + Z + 1) ; (9) 4 etc. In order to proceed we stipulate that a single nucleotide is represented as one of the four original directors ˜(i) = [˜a;˜t;˜c;˜g]. A sequence of nucleotides can be represented by a chaining (ordered sum) of the respective vectors, deEning a DNA-walk or a nucleotide composition vector potential function at base n as n ˜ ˜(i) : (n) = (10) i=1
The composition potential carries all the information of the DNA sequence, in a form suitable for computer analysis or graphical display. Its integral form allows a straightforward coarse graining for scaling analysis up to the size of the whole genome. The (mean) composition within an interval around n is given as n+=2 ˜ + =2) − (n ˜ − =2)] [(n 1 ˜(n; ) = dt˜(t) ; (11) = n−=2 while ˜(t) =
n i=1
˜(i)(t − i) =
˜ d (t) ; dt
(12)
where we have introduced the integro-di4erential notation with the help of the Kronecker delta function (t − i), localizing the nucleotides at the integer values of the sequence. Under this scheme, the DNA-walk assumes a discontinuous aspect as a
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
475
Fig. 2. Symmetry operations relating a DNA strand to its complementary strand as represented by DNA-walk projections. A global reJection of the walk on the x–y plane (x → x; y → y; z → −z; n → −n) is also given as: (a) A -rotation in z–n plane; (b,c) A mirror reJection in x or y against n or z plane; and (d) A unitary transform (with anti-sense progression) in the x–y plane.
polymer with phantom segments oriented in one of the four possible directions in a three-dimensional space. ˜ Once the composition potential (n) is represented as a walk in three dimensions, the complementary strand potential can be obtained with the symmetry operation of a -rotation around the z-axis for each base followed by a corresponding anti-sense reading; or equivalently, through a global reJection of the walk on the x–y plane (x → x; y → y; z → −z; n → −n) (see Fig. 2). Note that the redeEnition of the origin becomes arbitrary since only potential di4erences have a physical meaning. As a consequence of these symmetry properties, one sees that the shuMing of bases from a double strand, with originally unbalanced composition, cannot alter the net z-walk drift (Exed Charga4 ratio), while it will statistically produce the second parity rule (intra-strand complementary base quasi-equilibrium). The x–y plane orthogonal to the symmetry axis can also be decomposed into the simpler x − y set ˜x = (˜x + ˜y) = 12 (˜a − ˜t) ; ˜y = (˜y − ˜x) = 12 (˜c − ˜g) :
(13)
476
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
Composition or potential projections along these axes express the single strand partial skew (unequal concentration of complementary bases: excess of adenines over thymines and of cytosines over guanines) as X ≡ ˜ · ˜x = A − T ;
(14)
while ˜x · ˜a = −˜x · ˜t = ˜x · ˜x = 1 ; ˜x · ˜c = ˜x · ˜g = 0
(15)
˜x · ˜x = 2 :
(16)
and Finally, we wish to point that due to the codon (triplet) nature of genome information, walks within three distinct phases have proven meaningful. A codon allows for 43 =64 di4erent codes, although only 21 outputs result from the translation degeneracy. These codes can also be given as the speciEcation of one of 64 discrete unitary vector directions in a nine-dimensional space combining di4erent compositions into three ordered conjugate phases. Nevertheless, simpler schemes can be deEned. The three phases can be drawn independently (indexed) onto the same composition space, generating three walks; or a single composition projection for each phase could be chosen to deEne an orthogonal set to display a three dimensional projection of the complete phase structure. Two-dimensional projections of the Erst kind have been studied by the group of Cebrat and Dudek [7] and called 3-legged DNA-spiders. Within exons, each leg-phase displays a characteristic and individual composition trend. 3. Distance metric: composition evolution and walk divergence Each DNA sequence corresponds to a unique three-dimensional DNA-walk or vector ˜ potential function (n). On the other hand, a single site mutation (substitution) at i will uniformly displace the walk downstream from the mutation (n ¿ i) along one of the tetrahedron edges by N˜(i). A series of mutations thus corresponds to di4usion in walk space. Insertion and deletion mutations (indels) produce similar walk drifts. It then becomes feasible to correlate walk di4erences to mutation events. The most straightforward way to compare sequences is to use the intrinsic metric of DNA-walk (composition potential) space for distance measurements. It makes sense to Et sequence A to sequence B by translating one against the other until a best match is found. We call this process DNA-walk divergence analysis. We have implemented this analysis and made it available in the form of a computer program DNAWD (see footnote 1) 2 to calculate the minimum variance AB between sequence pairs relative to a variable site displacement n : 2 N N 1 1 ˜ 2 2 ˜ AB = |N(n; n )| − N(n; n ) ; (17) N N n=1
n=1
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
477
Fig. 3. (a) AT–CG walk (z-potential) of rDNA ITS regions for three dematiaceous fungal strains (black yeasts). The ITS1 and ITS2 are non-coding, highly variable regions. Similar species (Phialophora americana—thin continuous line and Phialophora verrucosa—dotted line) display similar walks. Distant species (Exophiala dermatitidis—thick line) display a distinct walk. (b) Sequence pair walk square distance 2 as a function of relative sequence displacement. The minimum of the function deEnes a best global alignment. Codes in parenthesis indicate GenBank accession numbers.
where ˜ A (n) − ˜ B (n + n ) ˜ N(n; n ) =
(18)
and the normalized sums run over the interval where the sequences superpose each other. The mininum variance corresponds to a gap-less global alignment of the sequences. Fig. 3a displays z-axis walk projections for 3 sequences globally aligned following this procedure. The sequences are from rDNA ITS regions for three dematiaceous fun2 for 2 pairs of these sequences gal strains [17]. The mean square walk variance AB is also represented in Fig. 3b as a function of the displacement n . Closely related species indeed display more similar walks and lower variance. Since the variance measures sequence pair di4erences (and aOnities) their relation to mutation history will be investigated next.
478
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
In the case of a single substitution, 2 of the 3 orthogonal directions will be a4ected. For example, A → T will change x and y-coordinates but not z, which only discriminates strong from weak binding bases. Since each coordinate changes obligatorily by a unit step at each position, the 2 changing coordinates will move uniformly up or down by 2 units downstream from the changing site n, i.e., to a mutation at site n in sequence A, producing the derived sequence A will correspond the (minimum) variance (N − n)2 8n(N − n) N −n 2 −4 AA = ; (19) (n) = 2 4 2 N N N2 with a mean value estimated over all mutation sites of 2 AA =
4 : 3
(20)
The insertion of a single base at position n, in a sequence of N bases will move the walk downstream the insertion point along the corresponding tetrahedron director (˜a;˜t;˜c;˜g). A moment’s thought will show that once ordinate change by one unit, null or either ±2 mismatches will occur at random for each axis (depending on sequence statistics). The random expectation for n ¿ N=2 is 3n(2N − n) N −n (N − n)2 2 AA = ; (21) − (n) = 3 2 2 N N2 N with a mean value estimated over the second half of mutation sites of 5 2 AA : = 4
(22)
The half sequence mismatch expectation (n ¿ N=2) was taken since an insertion in the Erst-half would lead to a symmetrically better Et from the superposition of the sequences’ second halves. 4. Application: phylogenetic analysis of toy sequences Consecutive mutation events will produce a walk dynamics similar to di4usion. The minimum composition potential variances become estimators of the number of mutations. For ns random substitutions and ni independent indels of mean size m, in the linear or low mutation density regime the expectation becomes: 2 ≈
4 5 ns + m · ni ≈ 1:3n ; 3 4
(23)
where n = ns + m · ni is the number of a4ected sites. The mutation distance n between two sequences can be estimated from their lowest mutual Et 2 =1:3. Under good statistical conditions (sequence size and density of mutations) the estimated mutation distance can be used as input for phylogenetic tree reconstruction. In order to test this approach, we construct and compare 16 sequences of 600 bases. The sequences were created by successive duplications for 4 generations. Each given pair of alleles is made to evolve independently with a base substitution probability of 1% and a random
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
479
Fig. 4. Phylogenetic tree for toy sequences submitted to mutation for 4 generations (see text for details). The tree to the left is estimated from DNA-walk analysis. The tree to the right depicts the ideal or theoretical tree. Too low or too high noise (mutation) in the data intrinsically limits the precision in tree reconstruction. The clades marked ( ) are correctly identiEed, the ones marked ( ) are not.
•
insertion of up to 10 consecutive bases with 0.5% probability. Realistic sequence size and noise were taken, typical of ribosomal hypervariate regions (ITS-rDNA) of black yeasts [17]. Sequences of this kind are known for being diOcult to align and to produce poor phylogenetic reconstruction. The 16 toy sequences were compared pairwise by the DNAWD program. Fig. 4 shows a tree calculated from the distance matrix output of DNAWD, using Phylip’s phylogenetic analysis program KITSCH. 2 From the 14 original clades, 10 were correctly identiEed which is a fair guess. There is no reason to expect a 100% correspondence from any method since there is only hope that noise (mutations) will statistically mark independent duplications well. Instead, a theory for the conEdence in estimates should be developed. The precision in distance estimates could be modeled for random noise and then transformed into a clade conEdence factor once a tree is proposed. Note that the bootstrap method used to access conEdence in local alignments does not apply to the global alignment described here.
2 KITSCH is a routine computing trees from distance matrixes using molecular clock hypothesis (constant mutation rates). It is included as part of the Phylip (v.3.5c) phylogenetics analysis package [18].
480
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
5. Conclusions The concept of a composition vector potential is developed here. It maps as projections into many of the proposed DNA-walk graphic displays. The integral nature of the potential allows for ease of coarse graining. Base pairing symmetries are readily retrieved in the potential and codon structure can be implemented as well. Sequences can be compared by a global alignment of their potentials. We call this technique a DNA-walk divergence analysis and show that it gives an estimate for mutation distances between sequences. Distances can be used for the inference of phylogenetic trees and this approach has been tested for the analysis of toy sequences. The estimated tree reproduces much of the ideal tree structure. Since the toy-sequence construction involves noise (with a Juctuating number, size and quality of mutations), the reconstruction is intrinsically inexact by any method. This observation parallels the formulation of the second law of thermodynamics or statistical physics on information propagating systems. The quest for ideal phylogenetic tools is still an open problem to which the DNA-walk or composition potential formalism sketched here could be of aid. The method for sequence comparison proposed and available through the DNAWD computer program is seen to already extract meaningful phylogenetic information from noisy data. An application to fungal pathogens is presented elsewhere [17]. Acknowledgements We would like to thank Ron Dickman for a careful revision of this presentation. This work was supported by the Brazilian agencies CNPq and FAPEMIG. References [1] C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H.E. Stanley, Long-range correlations in nucleotide sequences, Nature 356 (1992) 168–170. [2] S.V. Buldyrev, N.V. Dokholyan, A.L. Goldberger, S. Havlin, C.-K. Peng, H.E. Stanley, G.M. Viswanathan, Analysis of DNA sequences using methods of statistical physics, Physica A 249 (1998) 430–438. [3] H. Herzel, E.N. Trifonov, O. Weiss, I. GroRe, Interpreting correlations in Biosequences, Physica A 249 (1998) 449–459. [4] B. Audit, C. Thermes, C. Vaillant, Y. d’Aubenton-Carafa, J.F. Muzy, A. Arneodo, Long-range correlations in genomic DNA: a signature of the nucleosomal structure, Phys. Rev. Lett. 86 (11) (2001) 2471–2474. [5] J.R. Lobry, A simple vectorial representation of DNA sequences for the detection of replication origins in Bacteria, Biochimie 78 (1996) 323–326. [6] M. Picardeau, J.L.R. Lobry, B.J. Hinnebusch, Analysing DNA strand compositional asymmetry, to identify candidate replication origins of borrelia burgdorferi linear and circular plasmids, Genome Research 10 (2000) 1594–1604. [7] S. Cebrat, M.R. Dudek, The e4ect of DNA phase structure on DNA walks, Eur. Phys. J. B. 3 (1998) 271–276. [8] C.L. Berthesen, J.A. Glazier, M.H. Skolnick, Global fractal dimension of human DNA sequences treated as pseudorandom walks, Phys. Rev. A 45 (12) (1992) 8902–8913.
P. Licinio, R.B. Caligiorne / Physica A 341 (2004) 471 – 481
481
[9] E. Hamori, J. Ruskin, H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem. 258 (2) (1983) 1318–1327. [10] C.-T. Zang, A symmetrical theory of DNA sequences and its applications, J. Theor. Biol. 187 (1997) 296–306. [11] B. Silverman, R. Linsker, J. Theor. Biol. 118 (1986) 295. [12] E. Mizraji, J. Ninio, Graphical coding of nucleic acid sequences, Biochimie 67 (1985) 445–448. [13] Zu-Guo Yu, Bin Wang, A time series model of CDS sequences in complete genome, Chaos, Solitons and Fractals 12 (2001) 519–526. [14] H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, Scaling features of noncoding DNA, Physica A 273 (1999) 1–18. [15] S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, H.E. Stanley, Generalized LVevy-walk model for DNA nucleotide sequences, Phys. Rev. E 47 (6) (1993) 4514–4523. [16] P. Mackiewicz, A. Gierlik, M. Kowalczuk, D. Szczepanik, M.R. Dudek, S. Cebrat, Mechanisms Generating Long Range Correlation in Nucleotide Composition of the Borrelia Burgdoferi Genome, Physica A 265 (1999) 78–84. [17] R.B. Caligiorne, P. Licinio, J. Dupont, G.S. de Hoog, ITS-rDNA-based phylogenetic reconstruction in black yeasts and their relatives using algorithms with local and global sequence alignment, (2004) submitted for publication. [18] J. Felsenstein, PHYLIP (Phylogeny Inference Package) version 3.5c, Distributed by the author, Department of Genetics, University of Washington, Seattle, 1993.