Applied Mathematics Letters 24 (2011) 232–237
Contents lists available at ScienceDirect
Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml
Similarity analysis of DNA sequences based on the EMD method✩ Fenglan Bai a , Jihong Zhang a,∗ , Junsheng Zheng b a
School of Science, Dalian Jiaotong University, Dalian 116028, PR China
b
Department of Computer Science and Technology, Neusoft Institute of Information, Dalian 116023, PR China
article
info
Article history: Received 25 March 2010 Received in revised form 10 September 2010 Accepted 15 September 2010 Keywords: EMD IMFs DNA sequences Nonlinear signal sequence
abstract DNA sequences can be translated into 2D graphs and into numerical sequences; we call the numerical sequences nonlinear signal sequences. We can use the empirical mode decomposition (EMD) method to divide nonlinear signal sequences into a group of well-behaved intrinsic mode functions (IMFs) and a residue, so that we can compare the similarities among DNA sequences conveniently and intuitively. This work tests the method’s suitability by using the mitochondria of four different species. © 2010 Elsevier Ltd. All rights reserved.
1. Introduction Rapid development in biotechnologies has led to biological sequence data showing a sharp increase. The research into biological sequences, which can be seen as significant arcana of life processes, is a crucial and basic part of scientific study. There are many problems to study concerning biological sequences, for example, the DNA sequence content of genetic and regulated information, the relationship between structure and function of the protein sequences, protein secondarystructure prediction, the comparison of the similarities of biological sequences and so on. One central problem is that of comparisons among different kinds of biological sequences. The goal is to find and define the stability region and a law of variations, and then use them to find functional features and distinctions. Every time we get a new biological sequence, we always hope to prove that it is similar to a known sequence. Then we can predict the function of the unknown sequence through analyzing the relationship between structure and function in the sequences. Obviously, if the new sequence has homology with a known sequence, it will save much time and energy in recalibrating the function of the new sequence. This seems particularly important because of the bulkiness of the biological sequence. The template-matching technique is a typical and mature method, but the calculated quantities of this method are very extensive and the scoring function has some subjectivity. At the same time, some biological sequences are very long—for example, minimal simple cells (Escherichia coli) have about 5 × 106 base pairs—so it is difficult to analyze these sequences directly by comparison. So, people now work at finding new methods for carrying out the comparison and analysis of biological sequences, particularly whole genome sequences. For example, we can use graphs to stand for biological sequences; there is a method that converts biological sequences to graphs. This method makes it possible to compare biological sequences that have different lengths. Also it has been successfully applied to the genome shotgun sequences of many species [1–12]. The differences of DNA sequences of mitochondria are influenced only by mutations. The mitochondrial DNA sequence has being mutating at the speed of 2.2% per million years. It is called a conserved sequence [13]. As a result, there ✩ This work was supported by the Educational Commission of Liaoning Province of China (Grant No. 2009A125).
∗
Corresponding author. E-mail addresses:
[email protected] (F. Bai),
[email protected] (J. Zhang),
[email protected] (J. Zheng).
0893-9659/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.aml.2010.09.010
F. Bai et al. / Applied Mathematics Letters 24 (2011) 232–237
233
The corresponding nonlinear signal of common chimpanzee 2
1.5
1
s
0.5
0
-0.5
-1
-1.5
-2 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
t
Fig. 1. The 2D graph that corresponds to the mitochondrial DNA sequence of the common chimpanzee, D38116.
is application value in using similarity of DNA sequences to study relationships among different species. We select mitochondrial DNA sequences of four species—the common chimpanzee (D38116), pigmy chimpanzee (D3811313), fin whale (X61145), and blue whale (X72204)—as our research objects. Then we can make use of the EMD method to carry out research on the similarity. In another words, we convert DNA sequences to nonlinear signal sequences on the basis of the 2D graphs first; after that we will use the EMD method to get the IMFs of each DNA sequence and compare the corresponding residues. This will make DNA sequences more visual and simple. 2. The EMD method The EMD method is a generally nonlinear, non-stationary data processing method proposed by Norden Huang et al. [14] in 1998. With this decomposition, a complicated data set can be decomposed into a finite and often small number of IMFs that admit well-behaved Hilbert transforms and a residue. The decomposition can also be viewed as an expansion of the data in terms of the IMFs. In EMD, the original signal can be reconstructed by adding all IMFs and the residue without information loss. The EMD method is an adaptive and nonlinear signal decomposition approach. The decomposition is based on the following assumptions: (1) the signal has at least two extrema, namely one maximum and one minimum; (2) the characteristic time scale is defined by the time lapse between the extrema; and (3) if the data were totally devoid of extrema but contained only inflection points, then we can differentiate once or more times to reveal the extrema. Final results can be obtained by integration(s) of the components. EMD can be used to extract these intrinsic modes from the original signal, on the basis of the local characteristic scale of the data set itself, and to represent each intrinsic mode as an IMF, which meets the following two conditions: (1) in the whole data set, the number of extremes and the number of zero crossings must either be equal or differ by at most 1; (2) at any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is close to zero. The two conditions ensure that an IMF is a nearly periodic function and the mean is close to zero (less than a predetermined threshold). In practice, the IMFs are extracted through a sifting process. The EMD [14–16] algorithm is described as follows: Identify all the local maxima and minima of the original signal s(t ). Generate its upper and lower envelopes, eu (t ) and el (t ), with cubic spline interpolation. Calculate the point-by-point mean m(t ) from the upper and lower envelopes: m(t ) = (eu (t ) + el (t ))/2. Extract the mean m(t ) from the original signal and define the difference of s(t ) and m(t ) as h1 (t ), that is h1 (t ) = s(t ) − m(t ). (5) If h1 (t ) is an IMF, it is designated as c1 (t ) = h1 (t ), and if h1 (t ) is not an IMF, treat h1 (t ) as the original signal and repeat the steps (1)–(4) k times: h11 = h1 − m11 , h1k = h1(k−1) − m1k , until h1k (t ) is an IMF; it is then designated as c1 (t ) = h1k (t ). (6) After getting the first component, remove the first component from the original signal and obtain the residue r1 (t ) as follows: r1 (t ) = s(t ) − c1 (t ). (7) Treat r1 (t ) as the original datum and repeat the above processes. The second component c2 (t ) can be obtained. Repeat the process as described above n times. The nth residue can be acquired as rn (t ) = rn−1 (t ) − cn (t ). The decomposition process can stop when rn (t ) becomes a monotonic function or a constant from which no more IMF component can be extracted. (1) (2) (3) (4)
234
F. Bai et al. / Applied Mathematics Letters 24 (2011) 232–237 EMD of pigmy chimpanzee
r11 C11 C10
C9
C8
C7
C6
C5
C4
C3
C2
C1
s(t)
2 0 -2 2 0 -2 2 0 -2 2 0 -2 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0.5 0 -0.5 0.5 0 -0.5 0.4 0.2 0 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16000
18000
t
Fig. 2. The EMD of the pigmy chimpanzee. EMD of common chimpanzee
r11 C11 C10
C9
C8
C7
C6
C5
C4
C3
C2
C1
s(t)
2 0 -2 5 0 -5 2 0 -2 2 0 -2 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0.5 0 -0.5 1 0 -1 0.5 0 -0.5 0.4 0.2 0 0
2000
4000
6000
8000
10000
12000
14000
t
Fig. 3. The EMD of the common chimpanzee.
Thus, we achieve a decomposition∑ of the data into n IMFs, c1 (t ), c2 (t ), . . . , cn (t ), and a residue rn (t ), which can be either n the mean trend or a constant, s(t ) = i=1 ci (t ) + rn (t ). 3. Nonlinear signal sequence representation for DNA sequences In order to analyze the DNA sequence, we convert it into one-dimensional or multi-dimensional discrete complex sequences; this method is called signalization of a DNA sequence. It can also be called converting DNA sequences to nonlinear signal sequences. According to [17,18], we construct a map called the one-dimensional map: A → t = +1, T → t = +2, G → t = −1, C → t = −2. Certainly, any other map and arrangement is allowed. We can suppose that S = s1 s2 · · · may be an arbitrary DNA sequence. Then we can obtain a recursive formula as follows: S (ti+1 ) =
S (ti ) + S (tsi+1 ) d
.
(1)
si stands for a unit, S (ti ) stands for a one-dimensional signal sequence, d is a non-zero real number, and S (t0 ) = 0. According to the definition of the DNA sequence and the previous recursive formula, we can get the one-dimensional signal sequence that corresponds to a DNA sequence, no matter how long the DNA sequence is. For example, we can get
F. Bai et al. / Applied Mathematics Letters 24 (2011) 232–237
235
r11
C11 C10 C9
C8
C7
C6
C5
C4
C3
C2
C1
s(t)
EMD of fin whale 2 0 -2 5 0 -5 2 0 -2 2 0 -2 2 0 -2 2 0 -2 5 0 -5 5 0 -5 5 0 -5 5 0 -5 5 0 -5 2 0 -2 2 0 -2 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
t
Fig. 4. The EMD of the fin whale.
r11
C11 C10
C9
C8
C7
C6
C5
C4
C3
C2
C1
s(t)
EMD of blue whale 2 0 -2 5 0 -5 2 0 -2 2 0 -2 2 0 -2 1 0 -1 5 0 -5 5 0 -5 5 0 -5 5 0 -5 5 0 -5 2 0 -2 2 0 -2 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
t
Fig. 5. The EMD of the blue whale.
the signal sequence which corresponds to the mitochondrial DNA sequence of the common chimpanzee, D38116; here d is equal to 2. Also we get the 2D graph (see Fig. 1). 4. Analysis of similarity The DNA sequence is in one to one correspondence with its signal sequence, so if we want to analyze and compare the features and similarity of DNA sequences, the only thing that we need to do is compare the features and investigate the similarity of their signal sequences. We select four species for analyzing features and comparing similarity. First, according to Eq. (1), we can convert their mitochondrial DNA sequences to signal sequences. Secondly, we can use the appropriate EMD method based on cubic spline interpolation for the signals to get their corresponding IMFs and residues. Finally, we compare the residues. The closer their residues are, the more similar the two species are. This is shown in Figs. 2–7. From Figs. 2–5, we see that the sequences are very long, so it is difficult to judge the similarity for these four species and their corresponding signals s(t ). But after dissociating the signals s(t ) on the basis of EMD dissociation, the residues are monotonic. It is easy to judge the similarity from the residue graph. From Figs. 6 and 7 we can see that the common chimpanzee and pigmy chimpanzee are close. Also the fin whale and blue whale are close. As a matter of fact, the common chimpanzee and pigmy chimpanzee are primates; their genes are very similar. The fin whale and blue whale are Balaenopteridae, so their genes are also very similar. But there is some disparity between the chimpanzee and pigmy chimpanzee, and
236
F. Bai et al. / Applied Mathematics Letters 24 (2011) 232–237 The corresponding residue 0.4 common chimpanzee 0.2
r
0
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0.4 r
pigmy chimpanzee
0.2 0
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16000
18000
2 fin whale 0
r
-2
0
2000
4000
6000
8000
10000
12000
14000
2 r
blue whale
0 -2 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16000
18000
t
Fig. 6. The residue comparison in different subplots. The corresponding residue
2
blue whale
1.5
1
fin whale
0.5 common chimpanzee pigmy chimpanzee r
0
-0.5
-1
-1.5 0
2000
4000
6000
8000
10000
12000
14000
t
Fig. 7. The residue comparison in a single plot.
between the fin whale and blue whale. We can see from the above residue graphs that the results are in accordance with the real world. 5. Conclusion This work has studied how to use a one-dimensional map to convert DNA sequences to 2D graphs and to nonlinear signal sequences. Furthermore, we use the EMD method to divide the nonlinear signal sequence into a set of well-behaved IMFs and a residue. Also we can conveniently and directly establish the similarity of different DNA sequences by comparing the corresponding residues. In the end, we get a good method which is suited to investigating long DNA sequences. The suitability and effectiveness of this method are tested. Compared with BSDP which uses direct sign matching, it avoids the subjectivity of the gap penalty. Compared with methods that use matrix dimension reduction and compression, it avoids the loss of biological information. But this method is affected by endpoints, so it may not be used to compare short sequences. We hope that this method can be used in some areas such as comparative analysis of DNA sequences, prediction of protein structures, and comparison of RNA secondary-structure predictions. Comparison of sequences is still a developing area. A good measure will play an important role in enhancing the efficiency and accuracy for inquiring into, searching for and sorting repeated subsequences in the sequence, and discussing evolutionary relationships and evolutionary processes of different species.
F. Bai et al. / Applied Mathematics Letters 24 (2011) 232–237
237
References [1] M. Randic, M. Vracko, On the similarity of DNA primary sequence, Journal of Chemical Information and Computer Sciences 40 (2) (2000) 599–606. [2] M. Randic, On characterization of DNA primary sequences by a condensed matrix, Chemical Physics Letters 317 (11) (2000) 29–34. [3] M. Randic, M. Vracko, N. Lers, D. Plavsic, Novel 2-D graphical representation of DNA sequence and their numerical characterization, Chemical Physics Letters 368 (9) (2003) 1–6. [4] M. Randic, M. Vracko, N. Lers, D. Plavsic, Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation, Chemical Physics Letters 431 (5) (2006) 375–379. [5] M. Randic, J. Zupan, D. Vikic-Topic, D. Plavsic, A novel unexpected use of a graphical representation of DNA: graphical alignment of DNA sequence, Chemical Physics Letters 371 (13) (2003) 202–207. [6] Bo Liao, Tianming Wang, Analysis of similarity of DNA sequences based on triplets, Journal of Chemical Information and Computer Sciences 44 (6) (2004) 1666–1670. [7] Fenglan Bai, Tianming Wang, A 2-D graphical representation of protein sequences based on nucleotide triplet codons, Chemical Physics Letters 413 (6) (2005) 458–462. [8] Feng-lan Bai, Ying-zhao Liu, Tianming Wang, A representation of DNA primary sequences by random walk, Mathematical Biosciences 209 (1) (2007) 282–291. [9] V.A. Simossis, J. Kleinjung, J. Heringa, Homology-extended sequence alignment, Nucleic Acids Research 33 (3) (2005) 816–824. [10] M. Zhang, W.W. Fang, H.J. Zhang, X.Z. Chi, MSAID: multiple sequence alignment based on a measure of information discrepancy, Computational Biology and Chemistry 29 (2) (2005) 175–181. [11] H.O. Hasan, S. Khalid, A new sequence distance measure for phylogenetic tree construction, Bioinformatics 19 (16) (2003) 2122–2130. [12] A. Nandy, P. Nandy, On the uniqueness of quantitative DNA difference descriptors in 2D graphical representation models, Chemical Physics Letters 368 (2003) 102–107. [13] Fenglan Bai, Tianming Wang, The construction of phylogenetic tree by graphic representation of DNA sequences, WSEAS Transactions on Information Science and Applications 2 (2005) 463–467. [14] N.E. Huang, Z Shen, S.R. Long, et al., The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proceedings of the Royal Society of London Series A (S1471-2946) 454 (1998) 903–995. [15] N.E. Huang, M.L. Wu, W.D. Qu, Applications of Hilbert–Huang transform to non-stationary financial time series analysis, Applied Stochastic Models in Business and Industry (S1524-1904) 19 (3) (2003) 245–268. [16] X. Zhang, K.K. Lai, S.Y. Wang, A new approach for crude oil price analysis based on empirical mode decomposition, Energy Economics 30 (2008) 905–918. [17] J. Zupan, M. Randic, Algorithm for coding DNA sequences into ‘‘spectrum-like’’ and ‘‘zigzag’’ representations, Journal of Chemical Information Modelling 45 (2005) 309–313. [18] Fenglan Bai, Dachao Li, Tianming Wang, A new mapping rule for RNA secondary structures with its applications, Journal of Mathematical Chemistry 43 (2008) 932–943.