ExonSuite: Algorithmically optimizing alternative gene splicing for the PUF proteins

Computers in Biology and Medicine 43 (2013) 1023–1024 Contents lists available at SciVerse ScienceDirect Computers in Biology and Medicine journal h...

Download PDF

170KB Sizes 0 Downloads 66 Views

Report

PDF Reader
Full Text

Computers in Biology and Medicine 43 (2013) 1023–1024

Contents lists available at SciVerse ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

ExonSuite: Algorithmically optimizing alternative gene splicing for the PUF proteins Dilan Ustek a,n, Abraham Kohrman b, Bogdan Krstic c, Karissa Fernandez b a

Grinnell College, Computer Science Department, USA Grinnell College, Department of Biology, USA c Grinnell College, Department of Mathematics and Statistics, USA b

art ic l e i nf o

a b s t r a c t

Article history: Received 16 January 2013 Accepted 20 May 2013

The stability of mRNA and its translation is a vital process necessary for proper protein production. The speciﬁcity of the regulation is controlled by speciﬁc RNA motifs and regulatory proteins. Pumilio/fem-3 mRNA-binding factor (PUF) proteins are usually used in regulating mRNA stability as well as translation. Here, we optimized a PUF protein target ﬁnder program to understand the natural diversity of RNA recognition by this family of proteins. ExonSuite is available to compile and run at https://github.com/ dilanustek/ExonSuite. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Software Exon PUF proteins Gene splicing Data mining

1. Introduction To ﬁnd the target of PUF proteins is a novel way to engineer mRNA stability and translation [1,2]. This is done through engineering PUF proteins that can bind to speciﬁc sequences in order to either enhance or suppress inclusion of an mRNA in a protein product [3,4]. An engineered PUF protein can be used in regulating mRNA stability as well as translation [5]. The PUF proteins bind to RNA, and can be engineered to bind to speciﬁc eight-nucleotide sequences. PUF proteins can also be engineered to have different effects based on the composition of the C-terminal. Proteins with C-terminal arginine–serine rich domains tend to enhance inclusion of the exon, while proteins with glycine-rich domains tend to suppress inclusion of the exon [5,6]. It has been shown that PUF proteins can affect drug sensitivity of cancer cells [5]. Researchers would like to extend the usability of the PUF proteins, which is where this work comes in. The purpose of this work is to describe ExonSuite, a program which ﬁnds optimal sequences for PUF protein targeting. The program then builds a frequency table of every 8-mer (8 base pair sequence with ﬁfth position either C or U) that PUF proteins can bind to. The program then compares each 8-mer in the table across the whole genome, and counts how many times each 8-mer appears. For each target, the program returns the 8-mers with the fewest matches across the genome, and thus the fewest off-target effects,

n

Corresponding author. Tel.: +1 641 781 9487. E-mail addresses: [email protected], [email protected] (D. Ustek).

0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.05.014

which we take to minimize the undesired effects of PUF protein binding.

2. Algorithm The computational problem of choosing optimal PUF protein binding sites can be broken down into two different parts: ﬁnding all the valid 8-mer binding sites, and calculating the associated score for ‘off-target’ matches, i.e. matches that are not within the exon of interest. There are several different strategies which can be employed to solve each component. After experimentation we settled on an efﬁcient method which minimized computational redundancy. First, the skipsGen (part of ExonSuite) function breaks up the ﬁle into the constituent exons, a dictionary which is referred to as skips. Second, build frequencies generates a dictionary called frequencies containing possible valid 8-mers. Then, the freqCount (part of ExonSuite) function was run on skips and frequencies to update the frequencies dictionary to account for the occurrences of every valid 8-mer across every exon in skips. Next, the function bestExon (part of ExonSuite) was called on skips and the updated frequencies dictionary; this in turn calls best8mer on each exon in skips, which identiﬁes the 8-mer with the minimal off-target score within the exon. Finally, the results from bestExon were passed to ﬁleMaker (part of ExonSuite) in order to generate the output ﬁle. This method is qualitatively much more efﬁcient on large corpuses of genomic information than other approaches.

1024

D. Ustek et al. / Computers in Biology and Medicine 43 (2013) 1023–1024

3. Output We also designed a new ﬁle format containing our return information. This format consists of a two column system; the left column contains the header information from the individual exons in the given initial ﬁle. Separated by a colon on the right is the 8mer which received the lowest off-target score along with the associated off-target score (Supp. 1). This can be read through spreadsheet programs like excel and using any text editor. 4. Testing The algorithm was initially tested on a variety of generic sequences with easily predicted anticipated results, such as TTTTTTTTTTTATTTTTTTTTTT and similar simple sequences, with clearly deﬁned expected outcomes. Based on the success of our initial tests, we began testing the algorithm on actual data which we obtained from the UCSC Genome Browser (http://genome.ucsc. edu), speciﬁcally, the set of experimentally derived mRNA sequences from C. elegans. Initially, we focused on small subsets of the C. elegans mRNA, and were able to manually compare our results with the given data. In doing this, we found that the output was consistent with expectations, and aside from minor difﬁculties with regard to input formatting, were able to identify all the output best 8-mers in the input ﬁle. We proceeded to run the algorithm on larger subsets of the C. elegans mRNA, and would manually check portions of the output (which quickly grew to be substantial in size) against the input ﬁles. On a 2008 Apple MacBook with a 2.4 GHz Intel Core 2 Duo and 4 GB RAM running Mac OS X 10.6.8, the run time for the complete mRNA dump ﬁle was 2 min and 18 s, and the output was formatted appropriately, with a best 8-mer returned for every exon which contained a PUF protein compatible sequence, along with the number of times the 8-mer appeared across the mRNA dump excluding the exon of interest. In this test, the largest offtarget score was 371, and the lowest was 0. The average off-target score was 34.68, while the median was 35.91. 5. Results To obtain experimental data, we ran the algorithm on the same MacBook as above, but now using the inputs Mouse ESTs (Expressed Sequence Tags) (Supp. 2), Marmoset spliced ESTs (Supp. 3), and Turkey ESTs (Supp. 4). The ESTs consist of predicted genes obtained from cDNA transcripts, and are what the UNC lab will be focusing their computational efforts on. The mouse EST data ﬁle was 33.9 MB in size, and ExonSuite returned a 6 MB output ﬁle with no known errors. The mean best off-target score was 81.862, with a standard deviation of 122.652. The minimum off-target score was 0, and the maximum off-target score was 3459. The marmoset spliced EST data ﬁle was 2.7 MB in size, and ExonSuite returned a 45 KB output ﬁle with no known errors. The

mean best off-target score was 0.607, with a standard deviation of 0.873. The minimum off-target score was 0, and the maximum offtarget score was 4. The turkey EST data ﬁle was 105.9 MB in size, and ExonSuite returned a 1.8 MB output ﬁle with no known errors. The mean best off-target score was 151.327, with a standard deviation of 255.665. The minimum off-target score was 1, and the maximum off-target score was 2274.

6. Discussion ExonSuite (Supp. 5) is a powerful program to identify algorithmically optimal PUF protein binding sites. While illustrative, the output from ExonSuite should be interpreted mainly as a guide to further laboratory testing regarding hypotheses about the manipulation of genetic expression with PUF proteins. Further extensions to this algorithm could allow for the detection of many sequences following a regular pattern by modifying the way in which frequencies is generated.

Conﬂict of interest statement There is no conﬂict of interest.

Acknowledgments We would like to thank Professor Sam Rebelsky of Grinnell College for his invaluable debugging assistance and coding advice. We would also like to thank Prof. Zefeng Wang from UNC Pharmacology.

Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.compbiomed. 2013.05.014.

References [1] N.A. Faustino, T.A. Cooper, Pre-mRNA splicing and human disease, Genes Dev. 17 (2003) 419–437. [2] Malka Nissim-Raﬁnia, Batsheva Kerem, The splicing machinery is a genetic modiﬁer of disease severity, Trends Genet. 21 (9) (2006) 480–483. [3] Michael E. Rolish, Algorithms for simulating human pre-mRNA splicing decisions, Massachusetts Institute of Technology, 2005, Ph.D. thesis. [4] G.S. Wang, T.A. Cooper, Splicing in disease: disruption of the splicing code and the decoding machinery, Nat. Rev. Genet. 8 (2007) 749–761. [5] T.M.T. Hall, Y. Wang, C.G. Cheong, Z. Wang, Engineering splicing factors with designed speciﬁcities, Nat. Methods 6 (11) (2009) 825–830. [6] G. Yeo, V. Tung, M. Mawson, Z. Wang, M.E. Rolish, C.B. Burge, Systematic identiﬁcation and analysis of exonic splicing silencers (2004), Cell 119 (6) (2004) 831–845.