Chemical Physics Letters 379 (2003) 412–417 www.elsevier.com/locate/cplett
New 3D graphical representation of DNA sequences and their numerical characterization Chunxin Yuan *, Bo Liao 1, Tian-ming Wang Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, China Received 16 June 2003; in final form 29 July 2003 Published online: 17 September 2003
Abstract We consider a 3D graphical representation of DNA sequences and their numerical characterization. The representation also avoids loss of information accompanying alternative 2D and 3D representation in which the curve standing for DNA overlaps and intersects itself. The method is illustrated on the coding sequence of the first exon of human b-globin gene. Ó 2003 Elsevier B.V. All rights reserved.
1. Introduction The advantage of graphical representation of DNA sequences [1–5,7,8,10] is that they allow visual inspection of data, helping in recognizing major differences among similar DNA sequences [6,9]. Nandy [3] present a graphical representation by assigning A(adenine), G(guanine), T(thymine), and C(cytosine) to the four directions, ðxÞ, ðþxÞ, ðyÞ, and ðþyÞ, respectively. Such a representation of DNA is accompanied by: (1) some loss of visual information associated with crossing and overlapping of the resulting curve by itself; (2) an arbitrary decision with respect to the choice of the *
Corresponding author. E-mail addresses:
[email protected] (C. Yuan),
[email protected] (B. Liao),
[email protected] (T. Wang). 1 Fax: +86-411-4706100.
direction for the four bases. Randic [2] present a 3D graphical representation, but the limitation associated with crossing and overlapping of the spatial curve representing a DNA sequence remain. Recently, Randic [1] present a novel 2D graphical representation which avoids the limitation of NandyÕs approach. Hamori [10] present H-curve, which is a 3D graphical representation of DNA sequences. The four bases are represented by four directions (NW, NE, SE, and SW). Basic rule for constructing H-curve is to move one unit in the corresponding direction, and one for each unit in the z-direction. H-curve can uniquely represent a DNA sequence, while it requires 2D projection or 3D stereo projections of DNA sequences. In this Letter we introduce a new 3D graphical representation of DNA primary sequences, in which there is also no loss of information in the transfer of data from a DNA sequence to its
0009-2614/$ - see front matter Ó 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.cplett.2003.07.023
C. Yuan et al. / Chemical Physics Letters 379 (2003) 412–417
mathematical representation. Our approach is different from H-curve. The graphical representation we introduced is simple and direct, and requires no 2D projection or 3D stereo projections for it, however it cannot uniquely represent a DNA sequence.
2. Outline of the 3D graphical representation of DNA sequences We assign one nucleic base as follows:
413
Table 1 Cartesian 3D coordinates for the sequence ATGGTGCACC of the coding sequence of the first exon of human, b-globin gene Base
Nucletic
x
y
z
1 2 3 4 5 6 7 8 9 10
A T G G T G C A C C
)1 0 1 1 0 1 0 )1 0 0
0 )1 0 0 )1 0 1 0 1 1
1 2 3 4 5 6 7 8 9 10
ð1; 0; 0Þ ! A; ð1; 0; 0Þ ! G; ð0; 1; 0Þ ! T; ð0; 1; 0Þ ! C; That is to say, we assign A(adenine), G(guanine), T(thymine), and C(cytosine) to x, þx, y, and þy, respectively, while the corresponding curve extend along with z-axes. In detail, let G ¼ g1 g2 . . . be an arbitrary DNA primary sequence. Then we have a map /, which maps G into a plot set. Explicitly, /ðGÞ ¼ /ðg1 Þ/ðg2 Þ . . . ; where 8 ð1; 0; iÞ if gi ¼ A; > > < ð1; 0; iÞ if gi ¼ G; /ðgi Þ ¼ ð0; 1; iÞ if gi ¼ T; > > : ð0; 1; iÞ if gi ¼ C: For example, the corresponding plot set of the sequence ATGGTGCACC is fð1; 0; 1Þ; ð0; 1; 2Þ; ð1; 0; 3Þ; ð1; 0; 4Þ; ð0; 1; 5Þ; ð1; 0; 6Þ; ð0; 1; 7Þ; ð1; 0; 8Þ; ð0; 19Þ; ð0; 1; 10Þg We called the corresponding plot set be characteristic plot set. The curve connected all plots of the characteristic plot set in turn is called characteristic curve. The first point of the characteristic curve is at point ð1; 0; 1Þ, which belongs to A. So directed from the origin. From that point we move in the direction designed to T, we arrive then at the point ð0; 1; 2Þ, as the location of T. From here we move in the direction designed to G, ð1; 0; 4Þ as the location of G. Continuing in the direction of G, thus
we come to the point ð1; 0; 5Þ. Continuation of this process is illustrated in Table 1 for the 10 initial nucleic bases of the first exon. In Fig. 1, we show the 3D graphical representation of sequence ATGGTGCACC.
3. Numerical characterization of DNA sequences In order to find some of the invariants sensitive to the form of the characteristic curve we will transform the graphical representation of the characteristic curve into another mathematical object, a matrix. Once we have a matrix representing a DNA sequence, we can use some of matrix invariants as descriptors of the sequence. We also associate with the characteristic curve the matrices: E, bEc, M=M, and L=L. The bEc is a symmetric matrix, whose ði; jÞ element is defined as the integer of the Euclidean distance between vertices i and j of the zig-zag curve. The M=M matrix, L=L, k L=k L, and b L=b L matrix are given same as RandicÕs [1]. The upper triangle of the E=E, bEc=bEc, M=M matrix, and L=L matrix for the same segment of DNA sequence are give in Tables 2–4. In Fig. 1, we assign A, G, T, and C to x, þx, y, and þy, respectively, thus we got a characteristic curve. If we assign A, T, G, and C to x, þx, y, and þy, respectively, we can also obtain a characteristic curve which also represent the considered DNA sequence though they are based on different orders of the labels to the four semi-axes. The labels A, T, G, and C can be arranged in 24
414
C. Yuan et al. / Chemical Physics Letters 379 (2003) 412–417
Fig. 1. Characteristic curve of the sequence ATGGTGCACC, the dots denote the bases making up the sequence.
Table 2 The upper triangles of the E matrices of the sequence ATGGTGCACC Base
A
T
G
G
T
G
C
A
C
C
A T G G T G C A C C
0
1.732 0
2.828 1.732 0
3.606 2.449 1 0
4.243 3 2.449 1.732 0
5.385 4.243 3 2 1.732 0
6.164 5.385 4.243 3.317 2.828 1.732 0
7 6.164 5.385 4.472 3.317 2.828 1.732 0
8.124 7.28 6.164 5.196 4.472 3.317 2 1.732 0
9.11 8.246 7.141 6.164 5.385 4.243 3 2.449 1 0
Table 3 The upper triangles of the M=M matrices of the sequence ATGGTGCACC Base
A
T
G
G
T
G
C
A
C
C
A T G G T G C A C C
0
1.732 0
1.414 1.732 0
1.202 1.225 1 0
1.061 1 1.225 1.732 0
1.077 1.061 1 1 1.732 0
1.027 1.077 1.061 1.106 1.414 1.732 0
1 1.027 1.077 1.118 1.106 1.414 1.732 0
1.016 1.014 1.027 1.039 1.118 1.106 1 1.732 0
1.012 1.031 1.02 1.027 1.077 1.061 1 1.225 1 0
C. Yuan et al. / Chemical Physics Letters 379 (2003) 412–417
415
Table 4 The upper triangles of the L=L matrices of the sequence ATGGTGCACC Base
A
T
G
G
T
G
C
A
C
C
A T G G T G C A C C
0
1 0
0.8165 1 0
0.8077 0.8966 1 0
0.6847 0.672 0.8966 1 0
0.6792 0.6847 0.672 0.5774 1 0
0.6381 0.6792 0.6847 0.6383 0.8165 1 0
0.6145 0.6381 0.6792 0.6455 0.6383 0.8165 1 0
0.619 0.639 0.6381 0.6 0.6455 0.6383 0.5774 1 0
0.645 0.6654 0.6699 0.6381 0.6792 0.6847 0.672 0.8966 1 0
ways. But it does not mean there are 24 different representations of one DNA sequence. In fact, there are only three matrix representation of one DNA sequence. A characteristic curve can be rotated, and a rotation of characteristic curve is equivalent to a change of the order of the alphabet of fA T G Cg. Since a rotation does not change the characteristic curve, the matrix representation based on the order AGTC,GTCA,CAGT,TCAG have no difference. Further more, if we exchange the position of G and C, we got the order ACTG. And because this kind of exchange does not influence the distance between two vertices of the characteristic curve, the matrix representation based on the order AGTC is the same as the one based on the order ACTG. Rotation and exchange are illustrated in Fig. 2. Because of the two reasons mentioned above, we find that the 8 order AGTC,CAGT,TCAG,GTCA,ACTG,GACT, TG AC,CTGA result in the same matrix. So there are
Fig. 2. Rotation and exchange.
Fig. 3. Three arrangements of fA; G; T; Cg.
at most three different matrices of the characteristic curve representing the same DNA sequence. On the other hand, we can see from Fig. 3 changing the order from AGTC to AGCT and ACGT makes the distance between A and T difference. And the change from AGCT to ACGT brings the corresponding change of the distance between A and C. So the matrix based on order AGTC,AGCT,ACGT are different. That is, there are at least three different matrices of the characteristic curve representing the same DNA. So it comes to the conclusion that there are three matrix representations of the characteristic curve representing one DNA sequence. Bases of DNA can be classified into two groups, purine(A,G)/pyrimidine(C,T),amino(A,C)/keto (G,T) and week-H bond(A,T)/strong-H bond (G,C) [11]. From Fig. 3 we can find that the three representations correspond to the three classifications. We choose the leading eigenvalues of M=M and L=L matrices as DNA descriptors. Since the characteristic curve does not represent the genuine molecular geometry we are not interested in the interpretation of the leading eigenvalues of these
416
C. Yuan et al. / Chemical Physics Letters 379 (2003) 412–417
Table 5 The eigenvalues ki ði ¼ 1; 2; . . . ; 10Þ of the M=M, k L=k L; k ¼ 1; 2; 5; 10; 50; 100, and b L=b L matrices of the sequence ATGGTGCACC bEc
Eigenvalue
M=M
L=L
E
k1 k2 k3 k4 k5 k6 k7 k8 k9 k10
10.7541 00.2533 )0.4425 )0.7745 )0.9560 )1.3375 )1.4394 )1.8124 )2.0494 )2.1957
6.7634 0.1697 )0.1918 )0.5502 )0.7689 )0.9356 )0.9742 )1.0795 )1.1734 )1.2585
37.2530 34.3429 )0.7544 )0.5125 )0.8393 )0.5540 )1.1462 )0.6298 )1.3911 )0.7722 )1.8248 )1.000 )2.0761 )1.5053 )2.9588 )2.4259 )6.3358 )6.5114 )19.9265 )20.4317
b
L=b L
1.9190 1.6825 1.3097 0.8308 0.2846 )0.2846 )0.8308 )1.3097 )1.6825 )1.9190
2
L=2 L
5.2855 0.8879 0.3249 )0.2269 )0.6043 )0.8801 )0.9621 )1.1457 )1.2955 )1.3838
5
L=5 L
3.1865 1.7287 1.0166 0.3136 )0.2952 )0.7391 )0.9513 )1.2992 )1.4319 )1.5288
10
L=10 L
2.3023 1.8533 1.2525 0.6341 )0.0184 )0.5535 )0.9216 )1.3627 )1.4738 )1.7121
50
L=50 L
1.9213 1.6848 1.3092 0.8295 0.2819 )0.2874 )0.8322 )1.3103 )1.6802 )1.9167
100
L=100 L
1.9190 1.6825 1.3097 0.8308 0.2846 )0.2846 )0.8308 )1.3097 )1.6825 )1.9190
Table 6 The coding sequence of the first exon of human b-globin gene Eigenvalue
M=M
L=L
D=D
Scaled eigenvalue
M=M
L=L
D=D
k1 k2 k3 k4 k5 k6 k7 k8 k9 k10
93.3947 1.6513 1.5396 1.3068 1.2414 1.1696 0.9597 0.9341 0.8915 0.7879
57.1974 2.0160 1.3120 1.1488 1.0595 0.7279 0.6648 0.5606 0.4739 0.4341
0.27136 0.07010 0.06208 0.04361 0.03315 0.03236 0.02842 0.02563 0.02429 0.02071
s
100 1.7681 1.6485 1.3992 1.3292 1.2523 1.0276 1.0002 0.9546 0.8436
100 3.5248 2.2938 2.0085 1.8524 1.2726 1.1623 0.9801 0.8285 0.7590
100 25.833 22.877 16.071 12.216 11.925 10.473 9.445 8.951 7.632
k1 k2 s k3 s k4 s k5 s k6 s k7 s k8 s k9 s k10 s
The first 10 eigenvalues ki ði ¼ 1; 2; . . . ; 10Þ of the M=M, L=L, and D=D matrices of the sequence: ATGGTGCACCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGAGTAAGTTGGTGGTGAGGCCCTGGGCAG.
matrices, but are interested in them as numerical parameters that may facilitate comparisons of DNA sequences. The eigenvalues of M=M and L=L matrices for the DNA sequence of ATGGTG CACC are given in Table 5. There is some parallelism between the computed eigenvalues of M=M and L=L matrices, as could have been expected. The reason for including the L=L matrix in addition to the M=M matrix is that the k L=k L matrices, from which one can derive additional structurally related descriptors.
listed 10 leading eigenvalues of M=M, L=L, and D=D matrices. In order to make comparison of the eigenvalues of different matrices easier we show at
4. Application We will construct the DNA descriptors for the coding sequence of the first exon of human b-globin gene and will compare them with the descriptors based on NandyÕs graphical representation(eigenvalues D=D matrix). In Table 6, we
Fig. 4. The plot of some eigenvalues of the L=L matrix of the sequence of Table 6 vs the corresponding eigenvalues of the D=D matrix. (a) This work. (b) [1], Fig 2.
C. Yuan et al. / Chemical Physics Letters 379 (2003) 412–417
the right-hand side of Table 6 the eigenvalues scaled so that the leading eigenvalue for all three matrices equals 100. In Fig. 4, we show the plot of the leading eigenvalues of the L=L matrix of Table 6 against the corresponding eigenvalues of the D=D matrix. Because of great disparities between the leading eigenvalues of L=L and D=D in comparison with other eigenvalues, respectively, in Fig. 4 we excluded the largest eigenvalue. The main conclusion one can draw from Fig. 4 is that: (1) the numerical characterization of the 3D graphical representation, being easy to construct,and the older 3D or 2D representation based on graphical approach have similarities and (2) a few leading eigenvalues may suffice for characterization of DNA sequences.
417
References [1] M. Randic, M. Vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 368 (2003) 1. [2] M. Randic, M. Vracko, A. Nandy, S.C. Basak, J. Chem. Inf. Comput. Sci. 40 (2000) 1235. [3] A. Nandy, Curr. Sci 66 (1994) 309. [4] A. Nandy, P. Nandy, Curr. Sci. 68 (1995) 75. [5] A. Nandy, Curr. Sci. 70 (1996) 661. [6] M. Randic, X.F. Guo, S.C. Basak, J. Chem. Inf. Comput. Sci. 41 (2001) 619. [7] R. Zhang, C.T. Zhang, J. Biomol. Str. Dyn. 11 (4) (1994) 767. [8] M.L. Lan, M.S.T. Carpendale, 1998. [9] M. Randic, M. Vracko, M. Novic, in: M.V. Diudea (Ed.), QSPR/QSAR Studies by Molecular Descriptors, Nova Science, Huntington, NY, 2001, p. 145. [10] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318. [11] Ping-an He, Jun Wang, Internet Electron. J. Mol. Des. 1 (12) (2002) 668.