Molecular and Cellular Probes 28 (2014) 228e236
Contents lists available at ScienceDirect
Molecular and Cellular Probes journal homepage: www.elsevier.com/locate/ymcpr
Discrete Ramanujan transform for distinguishing the protein coding regions from other regions Wei Hua a, *, Jiasong Wang b, Jian Zhao c a
Department of Basic Courses, Nanjing Institute of Technology, Nanjing 211167, China Department of Mathematics, Nanjing University, Nanjing 210093, China c School of Science, Nanjing University of Technology, Nanjing 210009, China b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 22 August 2013 Accepted 17 April 2014 Available online 29 April 2014
Based on the study of Ramanujan sum and Ramanujan coefficient, this paper suggests the concepts of discrete Ramanujan transform and spectrum. Using Voss numerical representation, one maps a symbolic DNA strand as a numerical DNA sequence, and deduces the discrete Ramanujan spectrum of the numerical DNA sequence. It is well known that of discrete Fourier power spectrum of protein coding sequence has an important feature of 3-base periodicity, which is widely used for DNA sequence analysis by the technique of discrete Fourier transform. It is performed by testing the signal-to-noise ratio at frequency N/3 as a criterion for the analysis, where N is the length of the sequence. The results presented in this paper show that the property of 3-base periodicity can be only identified as a prominent spike of the discrete Ramanujan spectrum at period 3 for the protein coding regions. The signal-to-noise ratio for discrete Ramanujan spectrum is defined for numerical measurement. Therefore, the discrete Ramanujan spectrum and the signal-to-noise ratio of a DNA sequence can be used for distinguishing the protein coding regions from the noncoding regions. All the exon and intron sequences in whole chromosomes 1, 2, 3 and 4 of Caenorhabditis elegans have been tested and the histograms and tables from the computational results illustrate the reliability of our method. In addition, we have analyzed theoretically and gotten the conclusion that the algorithm for calculating discrete Ramanujan spectrum owns the lower computational complexity and higher computational accuracy. The computational experiments show that the technique by using discrete Ramanujan spectrum for classifying different DNA sequences is a fast and effective method. 2014 Elsevier Ltd. All rights reserved.
Keywords: Ramanujan sum 3-Base periodicity Protein coding region DNA sequence Discrete Ramanujan spectrum Signal-to-noise ratio
1. Introduction In the last years, a large number of computational techniques have been developed for the characterization and detection of protein coding regions (exon DNA sequences) in the DNA sequence. Recently, more and more efforts have been placed on Fourier analysis in gene-finding algorithms (Tiwari et al., 1997 [1]; Anastassiou, 2000 [2]) which are based on the well-recognized fact that an exon DNA sequence with length N has a 3-base periodicity property, which is identified as a pronounced peak at the frequency N/3 of the Fourier power spectrum, but the 3-base periodicity does not exist in most of noncoding regions, intron DNA sequences and other unknown DNA sequences (Fickett et al., 1992 [3]). The 3-base periodicity property of an exon DNA sequence is the fundamental
* Corresponding author. Tel.: þ86 13451850568. E-mail address:
[email protected] (W. Hua). http://dx.doi.org/10.1016/j.mcp.2014.04.002 0890-8508/ 2014 Elsevier Ltd. All rights reserved.
criterion for the researches of gene finding (Fickett et al., 1992 [3]; Fickett, 1996 [4]; Zhang, 2002 [5]; Mathé et al., 2002 [6]; Yin C. et al., 2007 [7], etc.). The technique of time-frequency domains analysis, Ramanujan Fourier transform (RFT), has recently been re-discovered for signal processing applications (Planat M. et al., 2002 [8]; Sugavaneswaran L. et al., 2012 [9]), being initially derived as building blocks of arithmetical functions in number theory (Ramanujan S. 1918 [10]). Mainardi L. T. et al. have applied RFT to analyze secondary structure content in amino acid sequences (Mainardi L. T. et al., 2007 [11]) and analyzed T-Wave alternans for help inspecting heart disease. (Mainardi L. T. et al., 2008 [12]). We have modified method of RFT and presented a novel method, Discrete Ramanujan Transform (DRT) and its power spectrum to distinguish protein coding regions from noncoding regions of a DNA sequence. The main idea is that finding a quantity likely the 3-base periodicity property of the Fourier power spectrum to describe the difference among exon, intron and other unknown sequences. The quantity is the value of a peak in the figure
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
of the discrete Ramanujan spectrum (DRS). In Section 2.4, the advantages and merits of DRT method comparing with discrete Fourier Transform (DFT) are also shown in this paper. In Section 3, we tested all the exon and intron sequences in whole chromosomes 1, 2, 3 and 4 of Caenorhabditis elegans. The histogram figures show the different SNR(3) value distributions between the tested exons and introns. Also by the statistical way, a threshold from the computational results is selected to separate exon sequences and intron ones for illustrating the reliability of the method.
229
And m(n) is the Möbius function (Mitrinovic D. S. et al., 1995 [13])
mðnÞ ¼
8 <
0; 1; : ð1Þk ;
if n contains a square bk > 1; if n ¼ 1; if n is the product of k distinct primes:
t Y b and n ¼ ni i (the unique prime number decomposition of n, n1 < n2
2. Concepts, method and algorithm remarks N 1 X gðnÞ; N n¼1
2.1. Discrete Ramanujan transform
Av ðgÞ ¼ limN/N
For a numerical sequence of length N, we denote it to be x(1), x(2), .., x(N). In RFT, each element x(n) (n ¼ 1,2,.,N) in the sequence can be expressed as
the mean value of an arithmetic function g(n), and X(k) can be shown as
xðnÞ ¼
N X
XðkÞ ¼
XðkÞck ðnÞ;
(1)
k¼1
where
k X
ck ðnÞ ¼
l ¼ 1;ðl;kÞ ¼ 1
l exp i2p n k
(2)
is the Ramanujan sum and where X(k), k ¼ 1,2,..,N, are known as Ramanujan coefficients or RFT. In (2), the symbolism (l,k) denotes the greatest common divisor of l and k (thus the term (l,k) ¼ 1 means that l and k are co-primes). The Ramanujan sum is the basis function over which x(n) is projected. It can be shown that Ramanujan sum satisfies many suitable properties such as the multiplicative and the orthogonal properties (Planat M. et al., 2002 [8]). Recall that the DFT of the numerical sequence fxðnÞgN n¼1 is defined as
b x ðpÞ ¼
2ipnp ; xðnÞexp N n¼1 N X
and if ck(n) in (1) is defined as
k ck ðnÞ ¼ exp i2p n ; N
(3)
then, the formula of discrete Fourier inverse transform is obtained. There is a close analogy between (2) and (3). The basis function ck(n), n ¼ 1,2,..,N, in (3) is obtained as multiples of a basis frequency (1/N) defined by the signal length. All basis function, ck(n)s, in (2) are built by summing up components which are multiples of the same periodicity (k) and only components (l/k), in which l and k are co-primes, contribute to the sum. For practical implementation, Ramanujan sums can be evaluated from the basic functions of number theory in a relatively easy way. The Ramanujan sums may be evaluated from the following relationship
ck ðnÞ ¼ m
k ðk; nÞ
fðkÞ
f
;
k ðk;nÞ
(4)
where (k,n) denotes the greatest common divisor (GCD) of k and n, ɸ(k) is the Euler totient function
1 fðkÞ ¼ k P 1 ; ki i t Y and k ¼ kai i (the unique prime number decomposition of k, ¼ 1 < kt are prime numbers and ai are positive integers.) k1 < k2 < i .
1
fðkÞ
Av ðxðnÞck ðnÞÞ;
k ¼ 1; 2; ..
(5)
We try to use the X(k) as a criterion quantity for distinguishing exon sequence from other sequences, and only by comparing these quantities of different sequences to complete the distinguishing process. The length of the test DNA sequences is finite, while in the formula of Av(g), N tends to be infinite. For practical usage of (5), we do not consider the limit process in (5) and modify the X(k) as
XðkÞ ¼
N 1 1 X xðnÞck ðnÞ; fðkÞ N n ¼ 1
k ¼ 1; 2; ..; N;
(6)
where N is the length of the DNA sequence and name the X(k) in (6) as discrete Ramanujan transform (DRT). Formula (5) is the formula for RFT and it is derived from the view of theoretical mathematics. We use (6) for a practical usage and the DRT abbreviation to show the difference between formulas (5) and (6). We use numerical representations for DNA sequences and transform them into numerical valued ones. With the usage of DFT, we get the result composed of a series of frequencies and corresponding amplitudes and phases. After computing the power spectrum of DFT for the series, we can get amplitudes and corresponding frequencies for the tested DNA sequences. We measure a certain peak in the power spectrum of DFT by its amplitude and its signalenoise-ratio. For Ramanujan’s sum, it is an integer, while the coefficients of DFT are complex numbers. For the measurement of DRT method shown in Section 3, we may show a similar criterion like the power spectrum of DFT for DRT. The power spectrum of DFT is defined as: * 2 PSðnÞ ¼ b x ðnÞb x ðnÞ ¼ jb x ðnÞj ;
n ¼ 1; 2; .; N
where N is the total number of points in the sequence. So we define the square of the coefficient as the spectrum of DRT. Similarly with DFT, the spectrum of DRT, DRS, is defined as
PðkÞ ¼ jXðkÞj2 ;
k ¼ 1; 2; ..; N:
(7)
2.2. Voss representation of a DNA sequence DNA molecules are composed of four linearly linked nucleotides, adenine (A), thymine (T), cytosine (C), and guanine (G). A DNA symbolic sequence can be constructed by the permutation of four characters A, T, C and G. In order to use technology of the digital signal processing to deal with the symbolic sequence, much work have been done to change the symbolic sequence into numerical
230
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
sequence (Voss, 1992 [15]; Tiwari et al., 1997 [1]; Yau et al., 2003 [16]). The Voss representation for DNA sequence is following. Firstly, we assign the following four 4-dimension unit base vectors to the corresponding four nucleotides in the DNA symbolic sequence
0 1 1 B0C B C/A @0A 0
0 1 0 B1C B C/T @0A 0
0 1 0 B0C B C/C @1A 0
0 1 0 B0C B C/G : @0A 1
Then, translate the DNA sequence to a numerical sequence in 4D space as
0 1 0 1 0 1 0 1 1 0 0 0 B0C B1C B0C B0C B C B C B C B ; n xðnÞ ¼ uA ðnÞ@ A þuT ðnÞ@ A þuC ðnÞ@ A þuG ðnÞ@ C 0 0 1 0A 0 0 0 1 ¼ 1;2;..;N; (8) where ua(n) is the indicator sequence
ua ðnÞ ¼
1; 0;
when a appears at location n; otherwise;
in which a˛{A,T,C,G} and n ¼ 1,2,..,N. 2.3. The power spectrum of DRT for a DNA sequence The DRT of Voss numerical representation for a DNA sequence is defined as
0 1 0 1 0 1 0 1 1 0 0 0 B0C B1C B0C B0C C B C B C B C XðkÞ ¼ UA ðkÞB @ 0 A þUT ðkÞ@ 0 A þUC ðkÞ@ 1 A þUG ðkÞ@ 0 A; k 0 0 0 1 ¼ 1;2;..;N; (9) where the Ua(k), a˛{A,T,C,G}, are the DRT for the indicator sequences. Therefore, the DRS of the DNA sequence is
P ðkÞ ¼ PA ðkÞ þ PT ðkÞ þ PC ðkÞ þ PG ðkÞ;
k ¼ 1; 2; ..; N; (10)
in which Pa(k), a˛{A,T,C,G}, are the DRSs of the indicator sequences of the DNA sequence. There is a noticeable phenomenon that in the DRS of exon sequence, one value of the DRSs, P(3), has much larger value than other DRSs while intron sequences do not have such property. Large amount of tests are performed to support our assertion in Section 3. The demonstration of the following two examples is to clarify how our newly introduced DRT method is performed on given DNA sequences. Also, for comparison, we calculated the power spectra of DFT and DRT of the test sequences, respectively. For the power spectrum of DFT at frequency k, we use the following formula defined in Ref. [1],
PSðkÞ ¼ PSA ðkÞ þ PST ðkÞ þ PSC ðkÞ þ PSG ðkÞ;
k
¼ 0; 1; 2; ..; N 1; where PSA(k), PST(k), PSC(k) and PSG(k) are the Fourier power spectrum of the four indicator sequences uA(n), uT(n), uC(n) and uG(n), respectively. The results of the computational experiments
are described by figures. Due to the fact that for real number signals the DFT spectrum shows its symmetry property, all the following DFT spectrum figures in this paper show only half of the original figures. The exon sequence we selected is from Homo sapiens chromosome 1, spanning from base 1,637,408 to 1,637,513 on NC_000001.11, Gene ID: 984 from NCBI. It is of 106 bp length. The computational results are in Figs. 1 and 2 respectively. From Fig. 1, we can see that there is a pronounced peak in the DFT power spectrum at frequency N/3. Also, from Fig. 2, there is a prominent peak at k ¼ 3 in the DRT power spectrum. The intron sequence we selected is from Homo sapiens chromosome 1, spanning from base 1,043,733 to 1,043,822 on NC_000001.11, Gene ID:375790. It is of 90 bp length. The figures of the Fourier spectrum and DRS for the intron sequence placed in Figs. 3 and 4 respectively. From Fig. 3, we can see that there is no pronounced peak in the DFT power spectrum at frequency N/3. Also, from Fig. 4, there is no prominent peak at k ¼ 3 in the DRT power spectrum. During the last years, the foundation of gene-finding algorithms involving DFT method is that an exon DNA sequence with length N can be identified as a pronounced peak at the frequency N/3 of the Fourier power spectrum, often called the 3-base periodicity property (Tiwari et al., 1997 [1]; Anastassiou, 2000 [2], Fickett et al., 1992 [3]; Fickett, 1996 [4]; Zhang, 2002 [5]; Mathé et al., 2002 [6]; Yin C. et al., 2007 [7], etc.). In Yin and Yau’s work (2005) [17], there is a conclusion that the 3-base property of a DNA sequence is determined by the unbalanced nucleotide distributions of the three codon positions, i.e., when a nucleotide has an unbalanced distribution in the three codon positions, there will be a 3-base periodicity in the corresponding indicator sequence. From the comparison of the above figures, we may think that the pronounced peak at k ¼ 3 in the DRT power spectrum of a DNA sequence is also determined by the unbalanced nucleotide distributions of the three codon positions. To validate this assumption, we tested many exons and introns. Also we generated many artificial DNA sequences. All the results imply that the unbalanced nucleotide distributions on the three codon positions contribute to the pronounced peak at k ¼ 3 in the DRT power spectrum in the indicator sequences. Suppose an artificial DNA sequence with the following probabilities for nucleotides on the three codon positions,
0
PA1 B PT1 B @ PC1 PG1
PA2 PT2 PC2 PG2
0 1 0:1 PA3 B PT3 C C ¼ B 0:2 @ 0:3 PC3 A 0:4 PG3
0:3 0:2 0:1 0:4
1 0:6 0:2 C C 0:1 A 0:1
where Pxi denote the probability of nucleotide X on the ith positions. We can get Fig. 5 after using DRT method. The DRT power spectrum of this artificial DNA sequence in Fig. 5 shows that there are clear peaks at k ¼ 3 for indicator sequences uA(n), uC(n), and uG(n), all of which have unbalanced nucleotide distributions (Fig. 5(a), (c), (d)). But the pronounced peak at k ¼ 3 does not exist in the indicator sequence uT(n) (Fig. 5(b)). The reason for the absence of the peak at k ¼ 3 in this indicator sequence is that the distribution of nucleotide T on the three codon positions is uniform, which does not contribute to the 3-base periodicity. Since the DRT power spectrum of the DNA sequence is the sum of the power spectra of four indicator sequences, there is a pronounced peak at k ¼ 3. A conclusion can be obtained: when a nucleotide has an unbalanced distribution in the three codon positions, there will be a pronounced peak at k ¼ 3 of the DRS in the corresponding indicator sequence.
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
231
Fig. 1. DFT power spectrum of the exon DNA sequence, which is the sum of the power spectra of the four indicator sequences.
Fig. 2. a) DRT power spectrum of the exon DNA sequence, which is the sum of the power spectra of the four indicator sequences. b) DRT power spectrum of the exon DNA sequence over k ¼ 2 to k ¼ 18, which is to spotlight the DRT power spectrum at k ¼ 3.
Fig. 3. DFT power spectrum of the intron DNA sequence, which is the sum of the power spectra of the four indicator sequences.
232
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
Fig. 4. a) DRT power spectrum of the intron DNA sequence, which is the sum of the power spectra of the four indicator sequences. b) DRT power spectrum of the intron DNA sequence over k ¼ 2 to k ¼ 18, which is to spotlight the DRT power spectrum at k ¼ 3.
According to these phenomena, the algorithm to distinguish the protein coding regions from the noncoding regions of a DNA sequence only needs to calculate the DRSs of those sequences and compare the DRSs at k ¼ 3 or one may check the figure of DRS of exon sequence at k ¼ 3 with a higher peak than the figures of noncoding sequences to distinguish exon sequence from other noncoding sequences. In order to increase the reliability of the criterion value of P(3), we will introduce the definition of signal-to-noise ratio. The background noise of a DNA sequence of length N, can be represented mainly as average power spectrum of DRT over k ¼ 2 to k ¼ N. So the average background noise (ABN) can be defined as follows,
c2(m) ¼ c2(n). From the property of integer period of Ramanujan sums, we can evaluate X(k) in a quite simple way by just using arithmetic calculations rather than time-consuming exponential floating point computations. For instance, for an incoming numerical represented DNA sequence of length N, X(3) can be evaluated as follows:
PN ABN ¼
k ¼ 2 PðkÞ : ðN 1Þ
(11)
The signal-to-noise ratio (SNR) of a DNA sequence of length N via DRT, can be denoted as
SNRðkÞ ¼
PðkÞ ABN
k ¼ 2; 3; ..; N:
(12)
And so
Pð3Þ : SNRð3Þ ¼ ABN 2.4. Algorithm’s remarks From the formula (4), we can get the first values of Ramanujan sums,
c1 ¼ 1; c2 ¼ 1; 1; c3 ¼ 1; 1; 2; c4 ¼ 0; 2; 0; 2; c5 ¼ 1; 1; 1; 1; 4; c6 ¼ 1; 1; 2; 1; 1; 2; .. where the bar indicates the period. For instance, c2(1) ¼ 1, c2(2) ¼ 1, c2(3) ¼ 1, c2(4) ¼ 1, . ., i.e., if m h n, mod2, then
where x(k) is the numerical representation of the DNA sequence at position k for certain nucleotide A,T,C or G.; Step 4. Increase k by 1 and repeat Step 3 until k¼N; 1 $ða$c3 ð1Þ þ b$c3 ð2Þ þ c$c3 ð3ÞÞ X ð3Þ ¼ fð3Þ$N Step 5. 1 $ða b þ 2cÞ ¼ fð3Þ$N Step 6. finish. Now we emphasize the advantages and merits of DRT method from three aspects. Firstly, modern computers use finitely many digits to represent real numbers, which may cause round-off error. Exponential numbers stored in computer have rounding errors, that is to say, not exactly the input value. Floating-point representation is now used to represent real numbers approximately. All the floating point input values, for DFT, the basis functions ei2pnk/N are floating
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
233
Fig. 5. DRT power spectra analysis of an artificial DNA sequence with the probabilities of nucleotides of the three codon position as described in text. (a) DRT power spectrum of indicator sequence uA(n). (b) DRT power spectrum of indicator sequence uT(n). (c) DRT power spectrum of indicator sequence uC(n). (d) DRT power spectrum of indicator sequence uG(n). (e) DRT power spectrum of the artificial DNA sequence, which is the sum of the power spectra of the four indicator sequences.
234
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
point numbers and stored in approximation. Thus, after DFT calculation, the accumulated errors may affect the results, which may quite different from the true values. For instance, in the following example, we use decimal with 5 digit precision to illustrate our statements above. In a theoretical way, 1.28965 2.36417 ¼ 3.0489518405, but the round-off result from the computer operation is 3.04895, the low part of the product (18405) is lost. While for DRT, all the coefficients Ck(n),f(k), N are all integers. When we use Voss representation for the input DNA sequence, the calculations for X(k) are all integer arithmetic computation operations and only the last division may cause floating point result. It is well known that integer numbers are stored in the form of fixed-point representation in modern computer, i.e., the integer operation in computer is error free, so, the operation of computational DRT has higher accuracy. Secondly, according to the modern computer structure, all the floating point computations are more time-consuming comparing with pure integer methods. Computers may take much more cycles of clock time on floating-point computation than fixed-point computation. Even in digital signal processor (DSP) especially invented for numerical computation, the computation time for floating point is much longer than that for integer. All the computations in DFT are floating point calculations, while all the computations in DRT just involve integer arithmetic. Much computing-time can be saved. Besides, when we discuss the computation complexity, usually just multiplication time and division time is considered. Let us go into the DRT algorithm in a much deeper way. Let us compare the add time between DRT and DFT for each incoming DNA sequence with Voss numerical representation. For DFT, when we compute b x ðkÞ at certain frequency k, we exactly need to compute b x A ðkÞ,b x T ðkÞ,b x C ðkÞ and b x G ðkÞ, the total times of add are 4(N 1). For
DRT, when we compute X(k) at certain k, for Voss numerical representation, x(k) ¼ 0 or 1, and Step 3 can be more simpler,
That is to say, if x(k) ¼ 0, we can skip Step 3 and go directly to Step 4. For a certain k, there is only one 1 and three 0s at position k of the four binary indicator sequences. Since in the above algorithm, most computations concentrate on Step 3, about three fourth computing times are saved. From the above discussion, we can see that the computation of Ramanujan sum only needs the arithmetic operations instead of the exponential calculations in formula (2), which may save much computational expensiveness, i.e., the algorithm of DRT has very lower computational complexity, in general. For computing the PS(N/3), the power spectrum of DFT at the special frequency, recently, in Yin and Yau’s work (2005) [17], a formula which can avoid the direct DFT computation has been proposed to reduce much computational time and it is widely used. Their valuable formula on the power spectrum at frequency N/3 of a DNA sequence is
Fig. 6. Indicates the Frequency histogram for SNR(3) of exons in C. elegans. a) for all the exons in chromosome 1. b) for all the exons in chromosome 2. c) for all the exons in chromosome 3. d) for all the exons in chromosome 4.
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
X
2 2 2 Fx1 þ Fx1 þ Fx1 ðFx1 ,Fx2 þ Fx1 ,Fx3 þ Fx2 ,Fx3 Þ ;
x ¼ A;T;C;G
where Fxi denotes the occurrence frequency of a nucleotide x(x˛(A,T,C,G)) in the ith (i ¼ 1,2,3) codon positions of a DNA sequence. Comparing the computation efficiency with the DRS of k ¼ 3, from the steps of computing P(3) stated in the algorithm remarks section, we can see that the two methods have the same order of magnitude in computation complexity.
235
Table 1 The relationship between threshold and frequencies with relative frequencies for all the exons in chromosome 1, 2, 3 and 4 in C. elegans. No. of chromosome
Chromosome Chromosome Chromosome Chromosome
1 2 3 4
SNR(3) boundaries (threshold)
SNR(3) frequency that SNR(3) is greater than the threshold
SNR(3) relative frequency that SNR(3) is greater than the threshold
6.3
15,008 15,500 13,467 14,802
0.7209 0.7074 0.7219 0.7052
3. Experiments and results for the general performance evaluation of the DRT algorithm In this section, the potential of the new approach is documented by more than 100 thousands of examples in the analysis of protein coding regions and noncoding regions of DNA sequences. We present the validity of DRT method for all the exons and introns in the whole chromosomes 1, 2, 3 and 4 of C. elegans. After the tests, we used statistics analysis on the signal-noise-ratio of the method, which are showed in histograms and tables and may be regarded as an intrinsic property of DRT on distinguishing exons and introns in DNA sequences. So a method involving DRT can be viewed as an alternative algorithm to separate exons and introns in an efficient and accurate way. The data resource of the exon an intron sequences used for the statistics study in the following are from http://bpg.utoledo.edu/ wafedorov/lab/eid.html, the ExoneIntron Database (EID) which contains annotated eukaryotic exon and intron sequences. We downloaded the data on C. elegans as flat file (ce2003.EID.tar.gz)
and selected chromosome 1, 2, 3 and 4 for our test. Altogether 82,373 exons and 69,717 introns are tested. We use the frequency histogram in probability and statistics theory to describe the phenomenon of our results. The horizontal axis of the histogram is the SNR(3) values of the tested DNA sequences while the vertical axis is the frequency of the tested DNA sequences corresponding to their SNR(3) values. Figs. 6 and 7 are the frequency histograms from the results of the tested exons and introns respectively. From the comparison between Figs. 6 and 7, we can see that the SNR(3) of exons are much higher than that of introns for each chromosome. Thus, for a given DNA sequence, putative exons and introns can be identified from their SNR(3) values. After studying the Figures, we set the threshold for SNR(3) and established Tables 1 and 2 for distinguishing exons and introns. From Tables 1 and 2, we can see that if we set SNR(3) boundaries (threshold) to be 6.3, then more than 70% of exons can be tested to
Fig. 7. Indicates the Frequency histogram for SNR(3) of introns in C. elegans. a) for all the introns in chromosome 1. b) for all the introns in chromosome 2. c) for all the introns in chromosome 3. d) for all the introns in chromosome 4.
236
W. Hua et al. / Molecular and Cellular Probes 28 (2014) 228e236
Table 2 The relationship between threshold and frequencies with relative frequencies for all the introns in Chromosome 1, 2, 3 and 4 in C. elegans. No. of chromosome
Chromosome Chromosome Chromosome Chromosome
1 2 3 4
SNR(3) boundaries (threshold)
SNR(3) frequency that SNR(3) is smaller than threshold
SNR(3) relative frequency that SNR(3) is smaller than threshold
6.3
11,656 11,940 10,389 11,666
0.655 0.6514 0.6561 0.6569
be true and more than 65% of introns can also be selected out, which means our new method may separate exon from intron in a precise way. 4. Conclusion and perspective After the investigation of Ramanujan sum and coefficients in this paper, we presented a novel technique for the nucleotide level analysis of DNA sequences, DRT. Similar with DFT, the spectrum defined by DRT, DRS, can be used for distinguishing the protein coding regions from the noncoding regions, because the value of SNR(3) can be used to describe 3-base periodicity property of exon sequences. The different values of SNR(3) for all the exon and intron sequences in whole chromosomes 1, 2, 3 and 4 of C. elegans, respectively, illustrate the reliability of the method. The algorithm analysis of DRT found that only arithmetic and integral operations are needed during the whole calculating process, i.e., the algorithm has the much lower complexity and higher computational accuracy than other signal processing methods involving exponential floating point calculations for the investigation of the properties of DNA sequences. The computational experiments show the technique based on the DRT to classify the DNA sequence is one of the fast and effective methods. And the technique may apply to the gene finding of the genome sequence, protein comparison and so on. Recently, due to the fastness and effectiveness of DRT, we have started to apply such method to the gene prediction of DNA sequence in our research work.
Acknowledgments We thank the reviewers for their helpful suggestions of improving the presentation of this paper. This paper is supported by the Science Foundation of Nanjing Institute of Technology under grants CKJB201219. Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.mcp.2014.04.002. References [1] Tiwari S, Ramachandran S, Bhattacharya A, Bhattachrya S, Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS 1997;113:263e70. [2] Anastassiou D. Frequency-domain analysis of biomolecular sequences. Bioinformatics 2000;16:1073e81. [3] Fickett JW, Tung C-S. Assessment of protein coding measure. Nucleic Acids Res 1992;20:6441e50. [4] Fickett JW. The gene identification problem: an overview for developers. Comput Chem 1996;20:103e18. [5] Zhang MQ. Computational prediction of eukaryotic proteincoding genes. Nat Genet 2002;3:698e710. [6] Mathé C, Sagot M-F, Schiex T, Rouzé P. Current methods of gene prediction, their strength and weaknesses. Nucleic Acids Res 2002;30:4103e17. [7] Yin C, Yau SS-T. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 2007;247:687e94. [8] Planat M, Rosu H, Perrine S. Ramanujan sums for signal processing of lowfrequency noise. Phys Rev E 2002;66(0561):281e7. [9] Sugavaneswaran L, Shengkun X, Umapathy K, Krishnan S. Time-frequency analysis via Ramanujan sums, signal processing letters. IEEE June 2012;19(6):352e5. [10] Ramanujan S. On certain trigonometrical sums and their application in the theory of numbers. Trans Camb Phil Soc 1918;22:259e76. [11] Mainardi LT, Pattini L, Cerutti S. Application of the Ramanujan Fourier transform for the analysis of secondary structure content in amino acid sequences. Methods Inf Med 2007;46(2):126e9. [12] Mainardi LT, Bertinelli M, Sassi R. Analysis of T-wave alternans using the Ramanujan transform. Comput Cardiol 2008;35:605e8. [13] Mitrinovic DS, Sándor J. Handbook of number theory. Dordrecht, Netherlands: Kluwer; 1995. [14] Carmichael R. Expansions of arithmetical functions in infinite series. Proc Lond Math Soc 1932;34:2. [15] Voss R. Evolution of long-range fractal correlation and 1/f noise in DNA base sequences. Phys Rev Lett 1992;68:3805e8. [16] Yau SS-T, Wang J, Niknejad A, Lu C, Jin N, Ho Y. DNA sequence representation without degeneracy. Nucleic Acids Res 2003;31:3078e80. [17] Yin C, Yau SS-T. A Fourier characteristic of coding sequences: origins and a non-Fourier approximation. J Comput Biol 2005;12:1153e65.