PAFFT: A new homology search algorithm for third-generation sequencers

PAFFT: A new homology search algorithm for third-generation sequencers

YGENO-08769; No. of pages: 3; 4C: Genomics xxx (2015) xxx–xxx Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/...

260KB Sizes 1 Downloads 60 Views

YGENO-08769; No. of pages: 3; 4C: Genomics xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

Methods Paper

PAFFT: A new homology search algorithm for third-generation sequencers Kazuharu Misawa a,⁎,1, Ryo Ootsuki b,c a b c

Advanced Center for Computing and Communication, RIKEN, Hirosawa 2-1, Wako, Saitama 351-0198, Japan Faculty of Chemical and Biological Sciences, Japan Women's University, Tokyo, Japan Department of Natural Sciences, Faculty of Arts and Sciences, Komazawa University, 1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan

a r t i c l e

i n f o

Article history: Received 29 April 2015 Received in revised form 2 September 2015 Accepted 14 September 2015 Available online xxxx Keywords: Homology search Third-generation sequencer

a b s t r a c t DNA sequencers that can conduct real-time sequencing from a single polymerase molecule are known as thirdgeneration sequencers. Third-generation sequencers enable sequencing of reads that are several kilobases long. However, the raw data generated from third-generation sequencers are known to be error-prone. Because of sequencing errors, it is difficult to identify which genes are homologous to the reads obtained using third-generation sequencers. In this study, a new method for homology search algorithm, PAFFT, is developed. This method is the extension of the MAFFT algorithm which was used for multiple alignments. PAFFT detects global homology rather than local homology so that homologous regions can be detected even when the error rate of sequencing is high. PAFFT will boost application of third-generation sequencers. © 2015 Elsevier Inc. All rights reserved.

1. Introduction In 2009, real-time sequencing from a single polymerase molecule was developed [1]. DNA sequencers that can conduct real-time sequencing from a single polymerase molecule are known as thirdgeneration sequencers. Third-generation sequencers enable sequencing reads that are several kilobases long. However, the raw data generated from third-generation sequencers is known to be error-prone [2]. Because of sequencing errors, it is difficult to find genes that are homologous to the reads obtained by third-generation sequencers. In this study, a new method for homology search algorithm, PAFFT, is developed. The method developed in this study is based on the multiple alignments using fast Fourier transform (MAFFT) algorithm [3]. Altschul et al. [4] developed ideas regarding local alignment statistics for their basic local alignment search tool (BLAST) search algorithm. In the case of a BLAST search, the probability of the score follows the extreme value distribution [5]. The distributions of the highest scores of correlations are obtained by using random computationally generated sequences. These distributions can be used for testing the significance of the similarity between the reference genome and the reads obtained by third-generation sequencers. PAFFT detects global homology rather than local homology so that homologous regions can be detected even when the error rate of sequencing is high. Repetitive elements in the genome, because of the presence of similar or identical sequences, sometimes cause mapping error of the ⁎ Corresponding author. E-mail address: [email protected] (K. Misawa). Present address: Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi 980-8573, Japan. 1

short reads [6–8]. The chloroplast genome of higher plants contains the inverted repeat region (IR) [9]. To test the utility of PAFFT, the existence of the IRs of the chloroplast genome of the pink barren strawberry, Potentilla micrantha, was detected using PAFFT. 2. Methods 2.1. Score based on MAFFT In PAFFT, we use similar scoring method as MAFFT [3]. In this section, we introduce the scoring method. The correlation between sequences is calculated, and the highest-scoring pair of sequences is identified. Let us consider the case where the there are two sequences; sequence 1 and sequence 2. N and M are lengths of sequence 1 and sequence 2, respectively. Let us define the correlation c(k) between sequence 1 and sequence 2, with the positional lag of k sites defined as cðkÞ ¼

X

v1 ðnÞT v2 ðn þ kÞ;

ð1Þ

1 ≤ n ≤ N;1 ≤ nþk ≤ M

where 8 ð1; −1Þ when the nucleotide at n site of sequence j is T > > < ð−1; −1Þ when the nucleotide at n site of sequence j is C ; u j ðnÞ ¼ ð1; 1Þ when the nucleotide at n site of sequence j is A > > : ð−1; 1Þ when the nucleotide at n site of sequence j is G v j ðnÞ ¼ u j ðnÞ−

N 1X u ðlÞ: N l¼1 j

http://dx.doi.org/10.1016/j.ygeno.2015.09.005 0888-7543/© 2015 Elsevier Inc. All rights reserved.

Please cite this article as: K. Misawa, R. Ootsuki, Genomics (2015), http://dx.doi.org/10.1016/j.ygeno.2015.09.005

2

K. Misawa, R. Ootsuki / Genomics xxx (2015) xxx–xxx

2.2. Probability density of correlation

Table 1 Type I error rate.

Let us consider the probability density of correlation. When the GC content of sequences is h, the variance of v1 (n)v2 (n + k)T, V[v1 (n)v2 (n + k)T] is obtained by h

T

V v1 ðnÞv2 ðn þ kÞ

i

4

¼ 2−ð2h−1Þ :

ð2Þ

The maximum value of V[v1 (n)v2 (n + k)T] is 2 when h = 50%. The central limit theorem states that the mean and sum of a large enough random sample from an arbitrary distribution have an approximately normal distribution. Thus, c(k) follows a normal distribution for which the variance is less than 2. Let us denote the number of possible positional lags by L. The probability distribution of the extreme value of c(k) can be obtained as   h n x−moi cðkÞ P pffiffiffi Nx ¼ exp − exp − : c 2

ð3Þ

Where m and c are parameters of the Gumbel distribution that pffiffiffiffiffiffi are obtained by expð12 m2 Þm 2π ¼ L and c ¼ m2mþ1 [10]. In this study, m is obtained by the Newton–Raphson method [11]. It is well known that PacBio sequencers cause sequence errors [2]. Thus, random nucleotides are put into the reference sequence. 2.3. Computer simulation To obtain appropriate threshold value, computer simulations were conducted. To produce a reference set, we generated 100,000 pairs of random DNA sequences with lengths of 1000, 2000, 5000, and 10,000 bp. We set the GC content of the generated sequences to 40% [8,12]. To estimate type I error, we simulated totally random sequences. To estimate type II error, we simulated a pair of artificial homologous sequences. We generate one random DNA sequences, then we copied it. We chose one of the pair sequences. We substituted by random nucleotides with error rates of 0%, 10%, and 20%. Because third-generation sequencers tend to reveal erroneous insertions [13], sequences in the simulation were generated, as they contain 20% insertions. 2.4. Data analysis As shown in Fig. 1, two identical copies are located in the IR regions. A and B regions in Fig. 1 are located outside of the IR region. In A region, there are ribosomal proteins S3, L22/L17e and S19. In B region, there is matK gene. In IR region, there are genes that encode ycf2 and ribosomal protein L2. As a references, DNA sequences of ycf2, ribosomal proteins

Length

1000 2000 5000 10,000

Significance level p b 0.01

p b 0.001

p b 0.0001

p b 0.00001

0.06137 0.07969 0.01944 0.02109

0.00737 0.00993 0.00166 0.00184

0.00062 0.00113 0.00008 0.00017

0.00006 0.0001 0 0.00002

L2, S3, L22/L17e, S19, mat K were obtained from the chloroplast genome of wild strawberry, Fragaria vesca (GenBank: KC507756). The homology search of the reads in the reference set was conducted using the new program and blastn (5) version 2.2.31. PacBio reads of P. micrantha ERP003833 were used [14]. The sequence reads were categorized with respect to their lengths. The number of reads of categories 1, 2, 3, 4, and 5 are 8688, 10,573, 6756, 2294 and 327, respectively. The significance level of PAFFT was set as 0.1%. We used the default parameter set for blastn, so that the E-value was 10. Thus, setting of PAFFT was more conservative than that of blast except category 2. When the Escherichia coli genome was read by PacBioRS, the reads contained 10.7% insertions, 4.3% deletions, and 0.9% substitution errors [13]. Thus, 14% of random nucleotides were put into the reference sequence. 3. Results 3.1. Computer simulation Table 1 shows the type I error rates obtained by computer simulations. Threshold values for given significance levels were obtained by using the extreme-value distribution (see Materials and Methods). This result indicates that type I error rates are higher than expected values obtained when the reference sequences are short. Table 2 shows the type II errors of homology search. Even when the error rate of the sequencer is 20%, PAFFT can find homology between sequences. 3.2. Data analysis Table 3 shows the number of reads taken from P. micrantha that are homologous to the reference sequence obtained from wild strawberry, F. vesca. This result indicates that the chloroplast genome of P. micrantha has two inverted repeat (IR) regions. Table 3 also shows the comparison between PAFFT and blastn. The results show that blastn found more homologous sequences than PAFFT when the read is Table 2 Type II error rates. Length

Fig. 1. Chloroplast genome of Fragaria vesca. fIRs are inverted repeats. A and B are adjacent regions of fIRs.

Significance level p b 0.01

p b 0.001

p b 0.0001

p b 0.00001

No error 1000 2000 5000 10,000

0.10447 0.07332 0.13231 0.10812

0.26268 0.20705 0.30422 0.25845

0.45166 0.38157 0.49325 0.43913

0.62448 0.55564 0.66086 0.61202

Error 10% 1000 2000 5000 10,000

0.17862 0.13169 0.22735 0.19402

0.38183 0.31393 0.43716 0.39022

0.58365 0.51021 0.63101 0.58296

0.74409 0.68044 0.77919 0.74076

Error 20% 1000 2000 5000 10,000

0.28183 0.22713 0.35905 0.31756

0.51993 0.45539 0.58751 0.54495

0.71585 0.65744 0.76492 0.72823

0.84737 0.80279 0.87786 0.85385

Please cite this article as: K. Misawa, R. Ootsuki, Genomics (2015), http://dx.doi.org/10.1016/j.ygeno.2015.09.005

K. Misawa, R. Ootsuki / Genomics xxx (2015) xxx–xxx

All calculations were performed using the RIKEN Integrated Cluster of Clusters (RICC).

Table 3 Number of reads that are homologous to the reference sequence. Software

Category 1

Category 2

3

Category 3

Category 4

Category 5

Number of hits on both of A and fIR blastn 4 75 PAFFT 0 6

109 439

68 310

22 93

Number of hits on both of B and fIR blastn 1 37 PAFFT 0 2

66 12

70 264

14 93

Acknowledgment

Sequence reads having lengths of b1 k bp are categorized into category 1. Those having lengths of ≥1 k bp but b2 k bp were categorized into category 2. Those having lengths of ≥2 k bp but b4 k bp were categorized into category 3. Those having lengths of ≥4 k bp but b8 k bp were categorized into category 4. Those having lengths of ≥ 8k bp were categorized into category 5.

shorter than 4 k bp. When the reads are longer than 4 k bp, PAFFT found more homologous sequences than blastn. 4. Discussion We developed PAFFT to find homologies between a reference genome and the reads obtained by third-generation sequencer. PAFFT detects global homology rather than local homology so that homologous regions can be detected even when the error rate of sequencing is high. In this paper, method for obtaining threshold values for PAFFT when the significance levels were given, by assuming the distribution of the maximum value of correlation follows the extreme value distribution of type I errors [11]. Computer simulations indicate that type I error rates are higher than expected values obtained especially when the reference sequences are short. When the expected type I error rates were calculated, we assume the distribution of the maximum value of score, c(k), follows the Gumbel distribution. This assumption does not hold true when the sequence length is short, because the central limit theorem states that the mean and sum of a large enough random sample from an arbitrary distribution have an approximately normal distribution. It is better to exclude short reads when homology search by PAFFT is conducted. In data analysis on P. micrantha and F. vesca. These species had diverged 24 Ma ago [15]. Sequence identity of ycf2 genes between these species is 99%. When the reads are long, PAFFT better than blastn. According to the results shown above, the existence of the repeat elements can be confirmed using PAFFT without assembly. PAFFT will be a useful tool for genome sequencing. Competing interests

We thank Dr. Marcelo Bento Soares and anonymous reviewers for valuable suggestions and comments. We also thank Dr. Ryutaro Himeno for his encouragement and support. This work was supported by JSPS KAKENHI Grant Number 26650146. References [1] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, S. Turner, Real-time DNA sequencing from single polymerase molecules, Science 323 (2009) 133–138. [2] M.A. Quail, M. Smith, P. Coupland, T.D. Otto, S.R. Harris, T.R. Connor, A. Bertoni, H.P. Swerdlow, Y. Gu, A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers, BMC Genomics 13 (2012) 341. [3] K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res. 30 (2002) 3059–3066. [4] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403–410. [5] S.F. Altschul, R. Bundschuh, R. Olsen, T. Hwa, The estimation of statistical parameters for local alignment score distributions, Nucleic Acids Res. 29 (2001) 351–361. [6] W. Li, J. Freudenberg, Mappability and read length, Front. Genet. 5 (2014) 381. [7] T.J. Treangen, S.L. Salzberg, Repetitive DNA and next-generation sequencing: computational challenges and solutions, Nat. Rev. Genet. 13 (2012) 36–46. [8] K. Misawa, RF: a method for filtering short reads with tandem repeats for genome mapping, Genomics 102 (2013) 35–37. [9] R. Kolodner, K.K. Tewari, Inverted repeats in chloroplast DNA from higher plants, Proc. Natl. Acad. Sci. U. S. A. 76 (1979) 41–45. [10] R.A. Fisher, L.H.C. Tippett, Limiting forms of the frequency distribution of the largest or smallest member of a sample, Proc Cambridge Philos Soc 24 (1928) 180–190. [11] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C, The Art of Scientific Computing, second ed. Cambridge University Press, Cambridge, 1999. [12] K. Misawa, A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes, BMC Genomics 12 (2011) 397. [13] M.J. Chaisson, G. Tesler, Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory, BMC Bioinformatics 13 (2012) 238. [14] M. Ferrarini, M. Moretto, J.A. Ward, N. Surbanovski, V. Stevanovic, L. Giongo, R. Viola, D. Cavalieri, R. Velasco, A. Cestaro, D.J. Sargent, An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome, BMC Genomics 14 (2013) 670. [15] W. Njuguna, A. Liston, R. Cronn, T.L. Ashman, N. Bassil, Insights into phylogeny, sex function and age of Fragaria based on whole chloroplast genome sequencing, Mol. Phylogenet. Evol. 66 (2013) 17–29.

The authors declare that they have no competing interests.

Please cite this article as: K. Misawa, R. Ootsuki, Genomics (2015), http://dx.doi.org/10.1016/j.ygeno.2015.09.005