Inferring Helitron Structures from 1D and 2D Representations Based on the Chaos Game Theory

Inferring Helitron Structures from 1D and 2D Representations Based on the Chaos Game Theory

JID:IRBM AID:434 /FLA [m5+; v1.252; Prn:1/03/2017; 14:43] P.1 (1-9) Disponible en ligne sur ScienceDirect www.sciencedirect.com IRBM ••• (••••) •••...

4MB Sizes 0 Downloads 49 Views

JID:IRBM AID:434 /FLA

[m5+; v1.252; Prn:1/03/2017; 14:43] P.1 (1-9)

Disponible en ligne sur

ScienceDirect www.sciencedirect.com IRBM ••• (••••) •••–•••

TAIMA

Inferring Helitron Structures from 1D and 2D Representations Based on the Chaos Game Theory I. Messaoudi ∗ , A. Elloumi Oueslati, Z. Lachiri Université de Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis, LR Signal, Images et Technologies de l’Information, BP 37, le Belvédère, 1002, Tunis, Tunisia Received 18 October 2015; received in revised form 10 September 2016; accepted 11 January 2017

Abstract This work introduces a new procedure to highlight helitron structures in the C. elegans DNA sequences. In this sense, the DNA strings are converted to numerical representations either in one or two dimensions. As a one dimensional coding, we choose the Frequency Chaos Game Signal (FCGS) which encodes the DNA based on the Chaos Game theory. Afterwards, a time–frequency representation – generated by the complex Morlet wavelet analysis – transforms the obtained signal into a 2D representation. Through the examples furnished in this paper, we demonstrate the accuracy of the method in emphasizing the helitron borders especially with high orders of FCGS. A comparison with the 2D spectrogram representation confirms the obtained results. © 2017 AGBM. Published by Elsevier Masson SAS. All rights reserved. Keywords: Frequency Chaos Game Signal; Complex Morlet wavelet transform; Short-time Fourier transform; Helitrons; C. elegans

1. Introduction The chromosomal DNA sequences are known by their heterogeneity in terms of length, composition, structure and function; which makes their study a difficult task. Among the complex DNA structures that attracted the scientists in last decades we turn our attention to helitrons. The Helitron is a specific class of Transposable-Elements (TEs) which is present in large portion of eukaryotic genomes [1]. In general, TEs can catch gene fragments and transpose them within the genome through the use of a specific mechanism (from where name “jumping genes”). This operation can lead to changes in nucleic acid sequences like insertion and deletion mutations; which contribute in turn in the genomes evolution [2]. In the case of helitron, the transposition mechanism is known by rolling-circle. The introduction of this structure was done by Vladimir Kapitonov and Jerzy Jurka in 2001 [3]. Because of its high polymorphism * Corresponding author.

E-mail addresses: [email protected] (I. Messaoudi), [email protected] (A. Elloumi Oueslati), [email protected] (Z. Lachiri). http://dx.doi.org/10.1016/j.irbm.2017.01.002 1959-0318/© 2017 AGBM. Published by Elsevier Masson SAS. All rights reserved.

(heterogeneity in size and sequence composition) as well as the lack of typical features in its structure, the automated identification of helitron remains particularly recalcitrant. Nevertheless, a large number of helitrons contain TC dinucleotide downstream the 5’-end and palindromic sequences upstream the 3’end which helped in someway to identify these elements. In this context, some analysis tools have been proposed. For example, Recon [4] and Spectral Repeat Finder [5] are based on the search of the repetitive sequences [2]. In other hand, HelitronFinder both uses the 3’-end and the palindrome sequences to detect helitrons [6]. HelSearch is another program which manually searches for the end structures in helitrons like hairpins and loops. After that, it compares these structures with known consensus sequences using MUSCLE (an alignment algorithm) [2,7]. The main drawback of these methods is the necessity of a prior knowledge about the helitron repetitive sequences and termini. In this respect, results obtained by these techniques are not accurate since full lists of consensus sequences do not exist. The techniques introducing an alignment module, require also an important runtime and most of them are obstructed by the sequences length. In line with this, we furnish two ways to

JID:IRBM AID:434 /FLA

2

[m5+; v1.252; Prn:1/03/2017; 14:43] P.2 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

highlight helitron sequences. The first method consists in coding the DNA strings into one dimensional signal on the base of the Chaos Game theory. We give the name of the “Frequency Chaos Game Signal” (FCGS) to this coding. As time series, the FCGS allows following the frequency evolution of nucleotides’ occurrence along the genome [8]. With high orders of the FCGS representation we can easily detect the helitron presence without the need of any prior knowledge about the enhanced region. The same thing goes to the second method which is a two dimensional representation of DNA using wavelets. The latter technique consists in applying the complex Morlet wavelet to the FCGS signal. As a result, the time–frequency repartition is shown to provide striking signatures of helitrons. In both the time and the time–frequency representations, the boundaries of helitrons are highlighted when we increase the FCGS level. In the light of this, we divide the paper into four sections. Section 2 gives an overview on our coding technique: the Frequency Chaos Game Signal (FCGS) as well as the complex Morlet wavelet analysis. Section 3 exposes the results. Finally, Section 4 concludes the work. 2. Coding the DNA sequences into one dimensional and two dimensional signals The DNA sequences are long chainlike molecules composed of four bases ‘A’ (Adenine), ‘T’ (Thymine), ‘C’ (Cytosine) and ‘G’ (Guanine). To ensure a direct interpretation of DNA, these characters could be transformed into a series of numerical values. In this section, we provide two ways to represent the DNA sequences into explicit signals and images. 2.1. Coding DNA by the Frequency Chaos Game Signal The Frequency Chaos Game Signal (FCGS) is a new DNA coding which replaces the DNA characters by their frequency of occurrence in the genome. The fundamental concept is inspired from the Chaos Game theory. The procedure consists in calculating the matrix of words occurrence frequency for the whole chromosome using the Frequency Chaos Game Representation (FCGR). The calculus depends on a defined word length (k). Effectively for each k-lengthen word we extract the appropriate frequency of appearance; then we assign this value to the considered position [8]. As illustrative example, we consider the sequence “ACCTAGCTGGA” which we encode by three levels of FCGS: FCGS1 , FCGS2 and FCGS3 (see Fig. 1). In general, the choice of the FCGS level (k) is arbitrary; but for ease of calculation we achieve the eleventh level (FCGS11 ). 2.2. Mapping DNA sequences into images based on the Complex Morlet wavelet analysis Because the DNA contains a broad scope of regular structures with varying size and composition, one must choose a tool that operates like a microscope and serves therefore to capture all the important trends in these sequences. In this framework, the wavelets offer a powerful way to view the details of time varying and transient phenomena in genomic signals. The

wavelets offer a multi-resolution decomposition of a signal by decomposing it into its elementary constituents across scale [9]. According to the following equation, the continuous wavelet transform (CWT) of a signal X(t) is obtained by convolution with a set of compressed and dilated versions of a specific function called mother wavelet ψ(t): 1 Tψ (X)(a, b) = √ a

+∞ t −b X(t)ψ  ( ) dt a

(1)

−∞

There are a variety of types of wavelet functions. The Complex Morlet wavelet is proven to be the most appropriate wavelet to study transients and non-stationary signals (such is the case of the genomic signals). It is a Gaussian-windowed complex sinusoid defined by: 1

1

1 2

ψ(t) = π − 4 (eiω0 t − e− 2 ω0 )e− 2 t 2

(2)

Here ω0 corresponds to the number of oscillations of the wavelet [10]. The absolute value of the obtained coefficients is then plotted in the time-scale plan. It is also possible to use the frequency for representing the wavelet coefficients instead of the scale since the frequency set is proportional to scale one [10,11]. This representation is called scalogram and communicates the time frequency localization property of signals. In this work, we use the complex Morlet wavelet analysis to map DNA as a 2D image (scalogram). 3. Results and discussions It is well-known that helitrons are complex DNA entities; their identification remains difficult despite the continued efforts to this end. The approaches used up to now have not taken into account the signal processing tools to detect the presence of helitrons. In fact, most of the proposed works within the framework of genomic signal processing were concentrated about segmenting genes into coding (exon) and non-coding regions based on the 3-bp periodicity in exons. For this, different techniques have been used: the smoothing [12], the filtering [13], the Auto-Regressive technique [13,14], the Choi–Williams Distribution [15,16], the Fourier Transform [16,17], the Wavelet Transform [18,19] and the Pitch Synchronous analysis [20]. In this work, we supply new perspectives in terms of DNA visualization and highlighting its hotspots. The first method, the Frequency Chaos Game Signal, is a 1D coding which represents a natural way for segmenting DNA without the need for any preprocessing or analysis technique. In the following we will demonstrate the effectiveness of this method in revealing the helitron entity in the C. elegans genome. For this, we consider an example of Helitron Y1A (which positioned on the chromosome I at [1317449–1319319]). The data is retrieved from the NCBI database http://www.ncbi.nlm.nih.gov/. Since the FCGS method offers a multitude of representative signals (FCGSs) for a same input sequence, we choose to code the chromosome I by the first eleven levels of FCGS. The FCGSs representations of the considered helitron are given in Fig. 2. For this example, amplitudes are multiplied by 100 for clarity purpose. Further,

JID:IRBM AID:434 /FLA

[m5+; v1.252; Prn:1/03/2017; 14:43] P.3 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

3

Fig. 1. Illustration of the FCGS process to represent the sequence ‘ACCTAGCTGGA’.

the red lines in the paper’s figures delimit the helitronic area. From the latter sub-figures, we can easily see that increasing the FCGS order implies a general smoothing of the signal. Indeed, the helitron area becomes enhanced from level 2 (FCGS2 ). For the last level (FCGS11 ), the signal amplitude dramatically falls in the borders of the helitron region; which makes it the best coding to represent the helitron structure. The second method consists on applying the complex Morlet wavelet analysis (along 64 scales) to the FCGS codings. The results are 2D representations of the DNA sequence in the time–frequency plan (scalogram images). In Fig. 3, we provide the Helitron Y1A scalograms when we encode with FCGS1 to FCGS11 .

The helitron Y1A scalograms exhibit common features (roughly the same form of the repetitive motifs). The principal difference between them is the enhancement of specific periodicities from one level to another. In this sense, the time– frequency behavior of the helitron constitutes a signature. The obtained signatures demonstrate that increasing the FCGS order serves to well locate the helitron especially when we reach the eleventh level. To prove even more the effectiveness of the second method of DNA representation, we compare with the classical Fourier spectrogram [21–23]. For this aim, we apply a sliding window (of variable size and overlap) to the FCGS11 signal. To provide a frequency content localization, we apply the Discrete Fourier

JID:IRBM AID:434 /FLA

4

[m5+; v1.252; Prn:1/03/2017; 14:43] P.4 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

Fig. 2. Helitron Y1A representation by FCGS1 to FCGS11 . (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

JID:IRBM AID:434 /FLA

[m5+; v1.252; Prn:1/03/2017; 14:43] P.5 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

Fig. 3. Helitron Y1A representation by scalogram when coded by FCGS1 to FCGS11 .

5

JID:IRBM AID:434 /FLA

6

[m5+; v1.252; Prn:1/03/2017; 14:43] P.6 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

Fig. 4. Helitron Y1A representation by spectrogram when coded by FCGS1 to FCGS11 .

JID:IRBM AID:434 /FLA

[m5+; v1.252; Prn:1/03/2017; 14:43] P.7 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

7

Fig. 5. Scalogram and spectrogram representations of different Helitrons and the correspondent FCGS11 signals.

Fig. 6. Visualization of the Helitron Y1A DNA alteration: the black dashed lines mark the sites of substitution; (a): representation of the original and the altered sequences by FCGS11 respectively in blue and red; (b): zoom in on the two signals. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

JID:IRBM AID:434 /FLA

8

[m5+; v1.252; Prn:1/03/2017; 14:43] P.8 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

Transform (DFT) on the resulting sequences. This analyze is known as the Short-Time Fourier Transform (STFT). For graphical rendering, the magnitude of the absolute value of the STFT coefficients is considered [24–26]. In our experimentations, we tested many types of windows with different sizes. The best visualization results are obtained for a Chebyshev window with 128 bp of length. Next, Fig. 4 shows the spectrograms for Helitron Y1A when coded with FCGS1 to FCGS11 . From the latter spectrograms, we can also note the effect of raising the FCGS order in the enhancement of the DNA frequencies. We can also note that our method outperforms the Fourier analysis in terms of repeats recognition. In fact, the scalogram representation permits a better localization of information in both time and frequency plans. Apart from this example, we succeeded in identifying other helitron structures in C. elegans due to their unique time– frequency signature. The tests were performed on the C. elegans genome (5 autosoms and 1 gonosom) and thus for the 10 classes of helitron (Helitron 1, Helitron Y1, Helitron Y1A, Helitron 2, Helitron Y2, Helitron Y3, Helitron Y4, NDNAX1, NDNAX2 and NDNAX3). Despite the variability of helitron in terms of size (which varies from 11 bp to 8965 bp) and structure, the helitron DNA is easily tracked due to its characterization by a specific behavior in the time and the time–frequency plans. As examples, we supply in the following the scalogram maps of four different helitrons as well as the correspondent spectrograms and the FCGS11 signals (see Fig. 5). The considered helitrons are taken from chromosome I: Helitron 1 [280520–281446], Helitron Y1 [559931–560711], Helitron Y1A [669652–671747] and Helitron Y4 [11486081–11486213]. It is important to mention that the helitron borders are accentuated using the FCGS11 coding in all cases; which facilitates its localization. In the other hand, the FCGS11 mapping offers itself a temporal signature for the helitron DNA. Despite that, passing to a time–frequency representation does not guarantee a time–frequency signature. Indeed, given its characterization by a specific signature, we can easily locate the helitron in the scalogram map independently of its class or size. However, with the spectrogram representation we cannot distinguish between two helitrons of different classes like in the case of Helitron Y1 and Helitron Y1A. As we have seen, the scalogram visualizes the internal arrangement of bases in helitron DNA and reveals the characteristic signature of the latter. Now, we would check the stability of the technique in the case of alteration. For this aim, we consider again the sequence of the Helitron Y1A (chromosome I: [1317449–1319319]). Since we are dealing with an FCGS of order 11, we chose a repetitive subsequence with size 11 bp to be subject of change. Our choice goes to the sequence “TTTTTTTGAAA” which is found 40 times in the helitron sequence. Obviously, if we just change a letter in the chosen segment, the correspondent frequency of apparition would be different so that the FCGS11 signal would look different. Consequently, any punctual change in the DNA sequence would be reflected in the scalogram representation. In this work, we substi-

Fig. 7. Visualization of the Helitron Y1A DNA alteration using the scalogram; (a): scalogram of the original sequence; (b): scalogram of the altered sequence; (c): scalogram of the difference between the correspondent signals.

tute the letter G by C (TTTTTTTGAAA → TTTTTTTCAAA) and we observe the layout of the signal (Fig. 6). The two signals have substantially the same appearance and contain different repeating units. With a zoom in on the signals, we can see clear transitions which correspond to the substitution sites. The typical repetitive motifs which relate to the alteration areas are encircled in black. We now turn to the scalogram representation. Fig. 7 provides the scalogram of the original helitronic sequence (Fig. 7(a)) as well as the scalogram of the modified sequence (Fig. 7(b)). Careful examination of these subfigures shows a slight differ-

JID:IRBM AID:434 /FLA

[m5+; v1.252; Prn:1/03/2017; 14:43] P.9 (1-9)

I. Messaoudi et al. / IRBM ••• (••••) •••–•••

ence between the two representations. In order to enhance the substitution sites, we consider scalogram representation of the difference between the two signals (Fig. 7(c)). The result illustrates zones with high energy which indicate the presence of alteration (we count 40 alterations). Overall, the methods we described in this work are efficient in terms of helitron DNA enhancement. Besides the fact that their implementation is simple, the two techniques offer the advantage of being independent from any pre-processing knowledge. 4. Conclusion Helitrons are highly heterogeneous DNA structures. This makes it difficult to detect them within the genome. Unlike traditional methods, we proposed in this paper two ways to enhance the presence of helitron entity without the need of any previous knowledge (such as repetitive motifs, hairpins and loops) or preprocessing operation. The first technique we introduced; aims at enhancing the Helitron structure in C. elegans simply based on a DNA coding. This coding, coined Frequency Chaos Game Signal (FCGS), uses the frequency of words apparition in the genome. As for the second method, it consists on applying the complex Morlet wavelet on the FCGS sequence. For both techniques, Helitrons of different classes are easily highlighted when we code with high orders of the FCGS representation. The 2D scalogram coding has shown in addition its usefulness in revealing helitrons compared to the 2D spectrogram representation as well as its aptitude in detecting alteration in helitron DNA. As perspective of this work, one can automate the search of helitrons in the genome based on the Frequency Chaos Game Signal with high order or just by recognizing their specific signatures. References [1] Munoz-Lopez M, Garcia-Perez J. DNA transposons: nature and applications in genomics. Curr Genomics 2010;11(2):115–28. [2] Yang L, Bennetzen J. Structure-based discovery and description of plant and animal helitrons. Proc Natl Acad Sci USA 2009;106(31):12832–7. [3] Kapitonov V, Jurka J. Rolling-circle transposons in eukaryotes. Proc Natl Acad Sci USA 2001;98:8714–9. [4] Bao Z, Eddy S. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 2002;12:1269–76. [5] Sharma D, Issac B, Raghava G, Ramaswamy R. Spectral repeat finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 2004;20:1405–12. [6] Wenwe X, Limei H, Jinsheng L, Hugo K, Chunguang D. HelitronScanner uncovers a large overlooked cache of helitron transposons in many plant genomes. Proc Natl Acad Sci USA 2014;111(28):10263–8.

9

[7] Edgar R. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004;32:1792–7. [8] Messaoudi I, Elloumi A, Lachiri Z. Building specific signals from frequency chaos game and revealing periodicities using a smoothed Fourier analysis. IEEE/ACM Trans Comput Biol Bioinform 2014;11(4):1–15. [9] Grossmann A, Morlet J. Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM J Math Anal 1984;15:723–36. [10] Najmi A, Sadowsky J. The continuous wavelet transform and variable resolution time–frequency analysis. J Hopkins APL Tech Dig 1997;18(1):134–40. Eindhoven University of Technology, Control Systems Technology Group Eindhoven. [11] Steinbuch M, de Molengraft MV. Wavelet theory and applications, a literature study, R.J.E. Eindhoven University of Technology, Control Systems Technology Group, Eindhoven, 2005. [12] Elloumi A, Lachiri Z, Ellouze N. Exon_intron separation using amino acids groups frequency repartition as coding technique. Int J Adv Comput Sci Appl 2014;5(4):27–32. [13] Vaidyanathan P, Yoon B. Gene and exon prediction using allpass-based filters. In: IEEE workshop on genomic signal processing and statistics. Raleigh, NC, 2002. [14] Akhtar M, Epps J, Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2008;2(3):310–21. [15] Melia U, Vallverdu M, Claria F, Gallardo J, Perera A, Caminal P. Choi– Williams distribution to describe coding and non-coding regions in primary transcript pre-MRNA. J Med Biol Eng 2013;33(33):504–12. [16] Melia U, Claria F, Gallardo J, Caminal P, Perera A, Vallverdu M. Exons and introns characterization in nucleic acid sequences by time–frequency analysis. In: 32nd annual international conference of the IEEE EMBS; 2010. [17] Tsonis A, Elsner J, Tsonis P. Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 1991;151(3):323–31. [18] Mena-Chalco P, Carrer H, Zana Y, Cesar R. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM Trans Comput Biol Bioinform 2008;5(2). [19] Wang L, Stein L. Localizing triplet periodicity in DNA and cDNA sequences. BMC Bioinform 2010;11(1):550. [20] Elloumi A, Lachiri Z, Ellouze N. Detecting particular features in C. elegans genomes using synchronous analysis based on wavelet transform. Int J Bioinform Res Appl 2011;7(2):183–201. [21] Anastassiou D. Frequency–domain analysis of biomolecular sequences. Bioinformatics 2000;12(16). [22] Fukushima A, Ikemura T, Oshima T, Mori H, Kanaya S. Detection of periodicity in eukaryotic genomes on the basis of power spectrum analysis. Genomics Inform 2011;13:21–9. [23] Sussillo D, Kundaje A, Anastassiou D. Spectrogram analysis of genomes. EURASIP J Appl Signal Process 2004;1:29–42. [24] Du L, Zhou H, Yan H. OMWSA: detection of DNA repeats using moving window spectral analysis. Bioinformatics 2007;23(5):631–3. [25] Kubicova V, Provaznik I. Use of whole genome DNA spectrograms in bacterial classification. Comput Biol Med 2016;69:298–307. [26] Bucur A, van Leeuwen J, Dimitrova N, Mittal C. Frequency sorting method for spectral analysis of DNA sequences. In: IEEE international conference on bioinformatics and biomedecine (BIBM); 2008. p. 43–50.