Available online at www.sciencedirect.com
Physics Procedia 33 (2012) 3 – 7
2012 International Conference on Medical Physics and Biomedical Engineering
SEQassembly: A Practical Tools Program for Coding 1 Sequences Splicing Hongbin Lee1, Hang Yang1, Lei Fu1, Long Qin1, Huili Li2, Feng He2, Bo Wang2, Xiaoming Wu2 1
School of life science and technology Xi'an Jiaotong University Xi'an 710049, P.R. China 2 The Key Laboratory of Biomedical Information Engineering of Ministry of Education Xi'an Jiaotong University Xi'an 710049, P.R. China
[email protected],
[email protected]
Abstract CDS (Coding Sequences) is a portion of mRNA sequences, which are composed by a number of exon sequence segments. The construction of CDS sequence is important for profound genetic analysis such as genotyping. A program in MATLAB environment is presented, which can process batch of samples sequences into code segments under the guide of reference exon models, and splice these code segments of same sample source into CDS according to the exon order in queue file. This program is useful in transcriptional polymorphism detection and gene function study.
©2012 2011Published Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [name Committee. organizer] © by Elsevier B.V. Selection and/or peer review under responsibility of ICMPBE International Keywords: CDS; splicing site; exon; mRNA; MATLAB.
1. Introduction Progress in modern biotechnology and medical science makes people deeply realized that almost all biology traits and human diseases are associated with some genetic factors. Mapping functional genes and detecting mutation sites in genome has become one of the keys in biomedical research. Since the establishment of sequencing methods of dideoxyribonucleotides developed by Sanger in 1977, DNA
1
This work is partially supported by National Natural Science Foundation of China (60601017).
1875-3892 © 2012 Published by Elsevier B.V. Selection and/or peer review under responsibility of ICMPBE International Committee. doi:10.1016/j.phpro.2012.05.022
4
Hongbin Lee et al. / Physics Procedia 33 (2012) 3 – 7
sequencing has become a basic tool for exploring the mystery of inherited information. Up to now, lots of genome sequences are generated from all kinds of DNA sequencing experiments. The gene sequences of eukaryotic are composed by a number of protein coding sequences (exons) and non-coding sequences (introns). Exon contains the information for protein synthesizing, covers most variations associated with individual phenotype, and holds only small part of genomic sequences. So, it is more suitable for researchers using it as target than using genomic sequences. The traditionally way to capture exon sequence includes searching the biology database, designing primers according to the conservative regions, polymerase chain reaction (PCR), sequencing by DNA analyze machine, making motif discovery in the sequences, and so on. To turn into CDS, the sequenced exon segments should be assembled. Common sequence assembly softwares such as Phrap[1], TIGR assembler[2] and CAP3[3] are developed to meet the requirement of genome shotgun sequencing [4], whose strategy is breaking a complete sequence into random small fragments, sequencing them separately, then compute each pair of fragments’ overlapping relationship, and connect them into longer sequences, as shown in figure 1.
Fig.1 The example of two sequences fragments assembly by computing fragments’ overlapping relationship, A and B
However, this strategy does not meet the need of coding sequence splicing site detection, for no overlapping relationship exists in adjacent exon sequences. Additionally, these softwares can only handle one sample at one time. However, genetic research such as control and case study needs dozens or even thousands of samples, and exon splicing is more rigorous than genome assembly. This work usually can be fulfilled by using multiple sequence alignment tools, together with repetitive operations on sequences files, which is complicated, time-consuming and error-prone. 2. METHODS The workflow of our CDS method can be outlined in five steps: (1) Retrieve the sequence of exon from biology database. Using software to predict the splicing sites of exons is not fully mature at present, so exon sequences can only be accurately obtained by the sequence comparison between nucleotide and cDNA. Some famous web pages, such as Splign in NCBI, provides service for computing alignments of cDNA to genomic. Only nucleotide sequences in FASTA format, or sequence accession number, are need for exon retrieval. See figure 2. (2) If more sequences correspond to one exon in the database, a best one was selected as exon template. All the sample sequences and exon templates are saved in a FASTA format file.
Hongbin Lee et al. / Physics Procedia 33 (2012) 3 – 7
Fig.2 Example of retrieve exon sequences by Splign from NCBI. The alignment between cDNA NM_214647 and nucleotide NW_732498 showed five exon-sequences.
(3) Using software or web service, multiple sequences alignments are performed, and results are saved in ALN format. The column positions of beginning letter and end letter of exon sequence are splicing site positions, which can be obtained by searching the conversion position from the insertion letter “-“ to base letter, as shown on figure 3.
Fig.3 Determine splicing site position (4) Cut all of the excess letters outside the exon region according to cut position computed from step3 and save the center sequence part in file format of FASTA. After trimming of MSA file targeting to one exon, add its exon remaining part file name to the splicing queue file. (5) Link every sequence trimmed in succession according to the exon order in the splicing queue file and save it in file format of FASTA. The schematic diagram of four exons CDS file construction process is shown on figure 4.
5
6
Hongbin Lee et al. / Physics Procedia 33 (2012) 3 – 7
Fig.4 The example of four exons CDS file construction procedure
3. RESULTS SEQassembly is a set of MATLAB programs, which can selectively cut sequence into segments under the guidance of reference exon model. There are five necessary modules in the SEQassembly, which are SEQassembly.m (Running module), assembly.m (Dispatching module), seqcutsimple.m (Cutting module), splice.m (Linking module), and Queuefilename.fas (Queue file). The requirements for running the program are listed here: 1) The length of every exon should be no more than 900bp. 2) Number of sample sequences should be consistent in every ALN and MSA file, or it will not generate output file "output.fas", and error information would be added into file "errorinfo.fas" 3) Add ALN filename one by one to file "Queuefilename.fas" according to corresponding exon order. 4) Run MATLAB file SEQassembly.m. 5) In running procedure, head files and tail files, which are residual parts after sequences have been cut are yielded and the prefix of their names are “head” and “tail” individually. The final CDS sequences are recorded in FASTA file “output.fas”. In summary, we also developed a WINDOWS application program named “SEQassembly.exe”. It need installing MATLAB runtime environment support (MCR: MCRInstaller.exe) previously. The source code of this study could be obtained through sending e-mail to the authors. Acknowledgment This study was supported by the National Natural Science Foundation of China (60601017), Scientific Research Foundation of Shaanxi Provincial Office of Health, P.R. China, and Fundamental Research Funds for Xi’an Jiaotong University. erences [1] M. D.L. B. Bastide, W. R. McCombie, “Assembling genomic DNA sequences with PHRAP,”
Current Protocols
Bioinformatics, Mar. 2007, UNIT 11.4, dio: 10.1002/0471250953.bi1104s17. [2] G.G. Sutton, O. White and M.D. ADAMS, “TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology,”, vol. 1, Apr.1995, pp. 9-19, doi:10.1089/gst.1995.1.9.
Hongbin Lee et al. / Physics Procedia 33 (2012) 3 – 7 [3] X.Huang and A.Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, vol.9,1999, pp.868–877, doi: 10.1101/gr.9.9.868. [4] S. Anderson, “Shotgun DNA sequencing using cloned DNase I-generated fragments,” Nucleic Acids Research, vol. 9, May. 1981, pp. 3015-3027, doi: 10.1093/nar/9.13.3015. [5] T.g. Burland, “DNASTAR’s lasergene sequence analysis software,” Methods In Molecular Biology, vol. 132, 1999, pp. 71– 91, doi: 10.1385/1-59259-192-2:71. [6] S. Kim, and A.M. Segre, “AMASS: AStructured Pattern Matching Approach to Shotgun Sequence Assembly,” Journal of Computational Biology, vol.6, 1999, pp. 163-186, doi:10.1089/cmb.1999.6.163. [7] T. Chen and S.S.Skiena, “A case study in genome-level fragment assembly,” Bioinformatics, vol. 16, Jun. 2000, pp. 494500. [8] P.A. Pevzner, H.Tang and M.S. Waterman, “An Eulerian path approach to DNA fragment assembly,” Proceedings of the National Academy of Sciences of the United States of America, vol.98, Aug. 2001, pp. 9748–9753, doi:y10.1073ypnas.171285098.
7