Fundamental asymmetry of insertions and deletions in genomes size evolution

Fundamental asymmetry of insertions and deletions in genomes size evolution

Journal of Theoretical Biology 482 (2019) 109983 Contents lists available at ScienceDirect Journal of Theoretical Biology journal homepage: www.else...

537KB Sizes 0 Downloads 24 Views

Journal of Theoretical Biology 482 (2019) 109983

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/jtb

Fundamental asymmetry of insertions and deletions in genomes size evolution Yang He a, Suyan Tian b,∗, Pu Tian c,∗ a

School of Life Sciences, Jilin University Changchun, 2699 Qianjin Street, China 130012 Division of Clinical Epidemiology, First Hospital of The Jilin University, 71 Xinmin Street, Changchun, China, 130021 c School of Life Sciences and MOE Key laboratory of Molecular Enzymology and Engineering, Jilin University 2699 Qianjin Street, Changchun, China 130012 b

a r t i c l e

i n f o

Article history: Received 21 March 2019 Revised 18 August 2019 Accepted 21 August 2019 Available online 22 August 2019 Keywords: Indels Random biological sequence Thought perfect genome C-Value enigma

a b s t r a c t The origin of large genomes that underlies the long standing “C-value enigma” is only partially explained by selfish DNA. We investigated insertions and deletions (indels) of nucleotides and discussed their relevance in size evolution of random biological sequences (RBS) and genomes. By developing a probabilistic model of RBS based on size evolution of expandable sites in a thought perfect genome, it was found that insertion bias engenders exponential increase of average RBS sizes. When combined with existing large segments of genome that are not subject to selection pressure (e.g. selfish DNA), such insertion bias results in explosive expansion of genomes, and therefore helps explain the “C value enigma” besides selfish DNA. Such increase of RBS size is caused by the fundamental asymmetry of indels, with insertions result in more available sites and deletions result in less deletable nucleotides. In qualitative agreement with the size distribution of known genomes, tails of RBS size distributions exhibit exponential decay with probabilities of larger RBS segments being smaller. Unsurprisingly, a slight deletion bias (higher deletions probabilities) results in a slow decrease of average RBS size and may lead to their eventual vanishing. Contrary to intuition, strictly balanced insertion and deletion results in linearly increasing instead of completely fixed RBS size. Nonetheless, such slow linear increase of average RBS sizes with time are small in magnitude and are consequently not influential on genome size evolution, and certainly not a major contributor for the “C-value enigma”. Our model suggested that insertion bias of nucleotides may provide complementary explanation for large genomes besides selfish DNA. The fundamental indel asymmetry is applicable for all forms of genomic insertions and deletions. Long-lasting exponential increase of genome size present energy and material requirement that is impossible to sustain. We therefore concluded that if there were explosively accelerating expansion caused by significant effective insertion bias for any survival species, it must have occurred sporadically. Our model also provided an explanation for the observed proportional evolution of genome size. © 2019 Published by Elsevier Ltd.

1. Introduction The idea of junk DNA traces back to 1972 in published form and has been widely accepted until recent publication of a series of papers by the ENCODE consortium (http://www.encodeproject.org). The summary ENCODE paper (Dunham et al., 2012) declared that about 80% of the human genome are serving some purpose, biochemically speaking. This conclusion was picked out of its experimental and biological background in a news report to declare that the idea of “junk DNA” has become history (Elizabeth, 2012).



Corresponding authors. E-mail addresses: [email protected] (S. Tian), [email protected] (P. Tian).

https://doi.org/10.1016/j.jtbi.2019.08.014 0022-5193/© 2019 Published by Elsevier Ltd.

Unsurprisingly, critiques followed (Eddy, 2012; Doolittle, 2013; Eddy, 2013; Graur et al., 2013; Niu and Jiang, 2013; Palazzo and Gregory, 2014; Palazzo and Lee, 2015). The analysis on the definition of functional elements by Doolittle (2013) and random sequence negative control proposed by Eddy (2013) provided great insight on further directions to take in clarifying the controversy. Niu and Jiang (2013) suggested that knockouts are ultimate robust ways of testing functional elements from junk DNAs. Essentially, debates regarding the ENCODE publications hinged on the definition of “functional elements” (Doolittle et al., 2014; Kellis et al., 2014; Germain et al., 2014). Kellis et. al. stated (Kellis et al., 2014) that one should integrate information from biochemical, genetic and evolutionary methods to accurately describe functions of

2

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

concerned DNA segments. Both (Doolittle et al., 2014; Germain et al., 2014) emphasized the importance of the “causal role” in defining function. Despite observed correlations between genome size and various physiological and environmental aspects in some limited number of species (Gregory, 2005; Bennett and Leitch, 2005), the “C-value enigma” (Petrov, 2001) has been one of central issue concerning genome size evolution and remains to be fully resolved. While selfish DNAs (represented by transposable elements (TEs)) proposed by Orgel (1980) and by Doolittle and Sapienza (1980) may explain a significant part of large genome size in some species (e.g. human), considerable part are not attributable to TEs (at least ∼ 35% in human (Robicheau et al., 2017)). It was found that neither sex/asexual reproduction nor TE are accountable for genome size differences in 30 investigated ˚ evening primroses species (Agren et al., 2015). The genome balance hypothesis (Freeling et al., 2015) may explain how junk DNA peacefully coexist with essential and important functional parts of genomes, but not how they came into being at the first place. Existence, expansion or shrinkage of random biological sequences (RBS, defined below) inevitably changes size of harboring genomes and relative weights of functional and non-functional sequences. Therefore, understanding of RBS evolution may well help in revealing the extent of functional sequences, and may lead to new insight into the long standing “C-value enigma”. Suppose there was an ideal organism and every nucleotide of its perfect genome was essential (structurally, functionally or for regulation) and insertions at any intra-functional-segment site was lethal. We use it as a start point of a thought experiment. As the perfect genome starts to replicate, there are certain probabilities of mutation, insertion and deletion. By definition of the perfect genome, any deletion will result in a non-viable offspring to survive natural selection. As far as the size of a perfect genome is concerned, mutation may be neglected as it does not contribute to size change. The only possible size change is an insertion at some expandable sites of the perfect genome. These expandable sites, by definition of the perfect genome, has to between two functional segments (genes or structurally/functionally regulatory sequences) that do not require immediate physical proximity. However, once an insertion happened, deletion of this inserted random sequence was neutral during the next round of replication. Here random biological sequences (RBS) were defined as sequences generated in the above mentioned way and are functionally neutral in terms of selection (except the engendered energetic, space and replication time costs) at the time of generation. Each expandable site is a potential start point for generation of RBS and may grow arbitrarily large before possible split into multiple expandable sites (when one or more segments of which evolve into essential functional elements) or getting deleted. It is important to note that firstly RBS are not genuinely random sequences as all RBS are made by DNA polymerase system(DPS), which is not a random sequence generating machine. As a coronary, RBS is species specific as each species has its own DPS. Additionally, DPS of a given species may evolve and therefore randomness (or biases) of RBS for a given species might also be time dependent on geological time scales. In this study, we wanted to answer the following questions. Firstly, are RBS essential parts of all genomes? If the answer was yes, what would be their size distributions? How would the relative importance of the insertion and deletion rates impact size evolution of RBS? There have been some efforts of inferring indels from analysis of both coding sequences and pseudogenes (de Jong and Ryden, 1981; Graur et al., 1989; Petrov et al., 1996; Ophir and Graur, 1997), and it was consistently found that indels exists widely, with deletion rates generally being larger than insertion rates. Based on these findings, Petrov et. al. proposed the equilibrium genome size model (Petrov, 2002). Gregory 2003; 2004 discussed many deficiencies of this model, and concluded

that while important, DNA loss via small deletion bias is not sufficient to explain large genome size differences. Kapusta et. al. found that deletion of various-sized segments are responsible for genome size differences in birds and animals (Kapusta et al., 2017). However, not much attention was targeted on the role of insertion bias since it has not been consistently observed in analyses of sequenced genomes. A simple mathematical model was developed in this study and some interesting results were inferred from the model. Firstly, our model suggests that small insertion bias (higher insertion probability) of even single nucleotide may lead to arbitrarily large genomes, and thus provides a possible complementary explanation to the C-value enigma besides selfish DNA. Secondly, tails of RBS length distributions exhibit exponential decay with shorter ones being much more likely than longer ones, this agrees qualitatively with the distribution of observed genome sizes (Oliver et al., 2007). The explosively accelerating growth of average RBS size under insertion bias is essentially caused by the fundamental asymmetry of insertions and deletions, with the former results in more sites for further insertion and the later results in less deletable nucleotides. Such asymmetry is applicable for any form of effective size variations for DNA segments that are not subject to selection pressure, and therefore are likely to dictate the global trend of genome size evolution. The observed proportional model of evolution (Oliver et al., 2007) may also be explained by the fundamental asymmetry of insertion and deletion. Additionally, contrary to intuition, strictly balanced insertion and deletion results in linearly increasing instead of fixed-size RBS, and consequently harboring genome. Again, such counter-intuitive phenomenon is caused by fundamental asymmetry of insertions and deletions. The models presented in this work was coded with C++ programming language to carry out computation, source codes are available upon requests. 2. Results 2.1. Evolution of average size for random biological sequences While it remains a mystery how life originates exactly, we do know that at some point of evolution and thereafter, DPS has been responsible for the relay of genetic information. As demonstrated by analysis of many sequenced genomes (de Jong and Ryden, 1981; Graur et al., 1989; Petrov et al., 1996; Ophir and Graur, 1997), both insertions and deletions are possible events during such informational relay. For simplicity, we hypothesize that the probability of a fixed insertion (Pi ) at an expandable site between any two current nucleotides and the probability of a fixed deletion (Pd ) of any current nucleotide are sequence independent. Correspondingly, the probabilities of any site/nucleotide to stay without insertion/deletion are 1 − Pi and 1 − Pd respectively. Additionally, simultaneous multiple insertions on one site are not allowed (i.e., between each immediate neighboring pair of nucleotides, insertion of only one nucleotide is allowed at each round of replication). With these two assumptions and denote the average length at the nth generation as L(n), one may come up with the following recursion:

L(n ) − L(n − 1 ) = (1 + L(n − 1 )) ∗ Pi − L(n − 1 ) ∗ Pd

(1)

given the initial condition of zero starting length of an expandable site, the solution to this recursion is:

L(n ) = Pi ∗

n−1 

(1 + Pi − Pd )k

(2)

k=0

When Pi = Pd = P, the solution reduces to:

L(n ) = nP

(3)

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

Since P is a small number, therefore with strict balanced indels, RBS sizes increase on average but in an insignificant way. For example, let P = .0 0 0 0 01, L(1, 0 0 0, 0 0 0, 0 0 0, 0 0 0 ) = 1, 0 0 0, 0 0 0, which is not a significant number compared with huge genomes we have observed up to date. Consider that 0.0 0 0 0 01 is a quite large number even for mutation rate (which is supposed to be larger than indel rates) of many DPS, we may conclude that the impact of linear increase on genome size under strictly balanced indels is trivial. When Pd > Pi , this finite sum converge to:

L (n ) =

Pi ∗ (1 − (Pd − Pi )(n−1) ) 1 − (Pd − Pi )

(5)

Again, Pi is a small number. This result indicates that, as one would intuitively expect, deletion bias would effectively purge all non-essential RBS over time. When Pi > Pd , let P = Pi − Pd , we have:

L(n ) = Pi ∗

n−1 

( 1 + P ) k

k=0

=

Pi

P

((1 + P )n − 1 )

r+1

Pn (r ) =

r−k 2   k=1 j=1

r−k+1 ) k+ j Pn−1 (r − k )Ck(+ Pi (1 − Pi )(r−2k− j+1)C r−k Pdj (1 − Pd )r−k− j j j

n (n−1 ) −r r 2  

+

k=1

+

r 

k k+1 j Ckr+ P (r + k )Pdk+ j (1 − Pd )r− j C r+ Pi (1 − Pi )r+k− j+1 j + j n−1

j=1

C rj Pn−1 (r )Pdj (1 − Pd )r− j C r+1 Pi j (1 − Pi )(r− j+1) j

j=0

0 ≤ r − k,

r + k ≤ 2(n−1)

and

n>1

(7)

with boundary conditions:

Pi 1 − Pd + Pi

≈ Pi

the following:

(4)

Since both Pi and Pd are small positive numbers, when n is large, (Pd − Pi )(n−1 ) vanishes, we have:

L (n ) ≈

3

(6)

In this scenario, the average length of RBS segments exhibit exponential growth with the number of generations. Given sufficient number of generations, small bias of even single nucleotide insertions may result in excessively large genomes. The above result is qualitatively different from widely utilized random walk model (e.g. polymerization), in which a biased walk does not cause exponential increase of the average distance (chain length for polymerization) from the origin. This fundamental difference is due to the fact that in random walk models (like polymerization), increase/decrease of distance may only occur at the distal ends while indels may occur anywhere within a RBS segment, which started from an empty expandable site. Larger insertions are expected to significantly accelerate this process as each time more sites for further insertions would be introduced, the compounding effect would thus be amplified, with magnitude of amplification decided by size of insertion segments. Fig. 1 shows the average size evolution under various insertion bias (Eq. (6)). It is evident that for various given (relatively large) Pi and P, the initial priming stage (slow growth) for a RBS segment is rather long. However, size expansion becomes explosive later on when there are sufficiently many sites for insertions. Nonetheless, we know that there are large DNA segments that are not subject to selection pressures (e.g. selfish DNA), which may provide effective priming for explosive growth of RBS under insertion bias. Additionally, it is obvious that long-lasting (on geological time scales) insertion bias would result in excessively large genomes which can not be accommodated by any cells, and therefore insertion bias may only exist sporadically for any surviving species. 2.2. Size distributions for random biological sequences Beyond the average length, we are also interested in the distributions of RBS length with given indel biases. The probability Pn (r) of a RBS having the length r after n rounds of replication (starting from an empty expandable site) may be recursively expressed as



Pn (r ) =

1 − Pi Pi

if r = 0 if r = 1

and n = 1 and n = 1

The first term represents the scenario where the net length increase of k nucleotides is resulted from k + j insertions and j deletions ; the second term represents the scenario where the net length decrease of k nucleotides is resulted from k + j deletions and j insertions, and the third term counts the situations where no net length change occurs due to equal number (j) of insertions and deletions. Analytical solutions for the distribution of RBS is not readily available. Even numerical solution is difficult when r, k and j are large numbers. However, the trends and distributions embodied by small r, k and j are numerically tractable and are qualitatively equally informative. For RBS segments of small sizes (not significantly larger than kilo-bases) and small Pi and Pd , the probability of multiple simultaneous and independent fixed insertions/deletions is expected to become increasingly smaller with increasing number of simultaneous insertion/deletion. So Eq. (7) may be truncated at first (k = 1), second (k = 2), third (k = 3) or higher order to only count for single, double, triple or more multiple simultaneous and independent insertions/deletions during one round of replication. For the first order truncation (k = 1), Eq. (7) may be simplified as the following:

Pn (r ) =(r − 1 )Pn−1 (r − 1 )Pi (1 − Pi )(r−1) + (r + 1 )Pn−1 (r + 1 )Pd (1 − Pd )r + Pn−1 (r )(1 − Pi )r+1 (1 − Pd )r 0
and n > 1

(8)

As the average length of RBS increases, the error caused by truncation grows rapidly. Utilizing largest possible k (number of available insertion sites for single nucleotide insertion case) at each iteration, however, renders the total computational cost prohibitive in a small number of iterations. To make sure that our truncations are computationally tractable on the one hand, and have acceptable accuracy on the other hand for tested parameters (Pi , P), we monitored the evolution of the following summation:

Psum =

r= rmax 

P (r )

(9)

r=1

rmax is the largest r reached at a given iteration with the given truncation order k. By definition, Psum should always be 1.0 at each iteration if there was no truncation error. As accumulation of truncation error becomes severe, Psum drops rapidly as shown in Fig. 2. We therefore control the truncation error by only utilizing data numerically generated according to Eq. (7) with Psum > 0.99 (Eq. (9)). To qualitatively show the distribution for RBS length at the early growth stage, we chose to use Pi = 0.001, P = 0.0 0 01 and k = 150 and plotted corresponding length distributions as shown in Fig. 3. The tail part of all the distributions exhibit exponential decay, in qualitative agreement with the observed genome

4

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

eminent. Therefore, in terms of RBS size distribution, truncation would result in narrowing the size range of RBS. 2.3. Extension to indels of nucleotide segments As one may imagine and as indicated by frequency and size analysis of indels in many species (de Jong and Ryden, 1981; Graur et al., 1989; Petrov et al., 1996; Ophir and Graur, 1997; Nóbrega et al., 2004; Kapusta et al., 2017), indels comes in a wide variety of sizes and rates in real genomes. For combinations of non-interacting various indel sizes and probabilities, effective insertion bias may be defined as: ilmax 

Pil Nilsite >

il=1

Fig. 1. Evolution of average size (L(n)) as a function of the number of generations (n) for RBS with different single nucleotide insertion probability (Pi ) and corresponding insertion-deletion probability differences (P), Both axes are shown with log scale. Purple: Pi = 10−3 , P = 10−4 ; Red: Pi = 10−7 , P = 10−8 ; Green:Pi = 10−7 , P = 10−9 ; Blue:Pi = 10−7 , P = 10−10 ; The filled circles on the purple line indicate where the distributions of RBS sizes were shown in Fig. 3 (Cyan circle: n = 30 0 0 0, Fig. 3a; Blue circle: n = 40 0 0 0, Fig. 3b; Black circle: n = 50 0 0 0, Fig. 3c; Red circle: n = 60 0 0 0, Fig. 3d). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

seg Pdl Ndl

(10)

dl=1

where ilmax is the maximum insertion length, Pil is the probability of insertion for nucleotide segments with length il, Nilsite is the number of available sites for insertion of segments with length il. Similarly, dlmax is the maximum deletion length, Pdl is the seg probability of deleting a nucleotide segment with length dl, Ndl is the number of available segments for deletion with length dl. In the same way, effective deletion bias may be defined as: ilmax 

Pil Nilsite <

il=1

size distributions (Oliver et al., 2007). Meanwhile, with increasing number of generations, the length range increase dramatically. This is expected from the fundamental asymmetry of indels as longer RBS grows faster. The observation is consistent with and effectively explains the proportional model of evolution (Oliver et al., 2007), which summarizes observed genome size distribution but did not propose a theoretical model for explanation. Additionally, the peak of distribution starts at the zero length (Fig. 3a) and slowly moves to larger value (Fig. 3c–f) with increasing number of generations. Unfortunately, with further iterations, Psum started to drop rapidly and we were not able to solve for the length distributions at later (and more explosive) growth stages. Nonetheless, the qualitative trend observed in Fig. 3 should persists. Since these parameters (Pi and Pd ) are exaggerated (Senra et al., 2018), these length distributions are therefore only qualitatively informative. We emphasize that even with single nucleotide indels, numerical solutions of Eq. (7) with no truncation rapidly become computationally prohibitive with increasing RBS size. The net effect of truncation is reducing the rate of average size expansion of RBS. Apparently, for larger sizes of RBS, the effect of truncation becomes more

dl max 

dl max 

seg Pdl Ndl

(11)

dl=1

This would cause reduction of average RBS sizes with the speed of reduction dependent upon specific combination of indel sizes and probabilities. However, we can be quite sure that due to the fundamental asymmetry of genomic indels, reduction speed decelerate with time for any given deletion bias parameter sets. And balanced indels may be defined as: ilmax  il=1

Pil Nilsite ≈

dl max 

seg Pdl Ndl

(12)

dl=1

These Eqs. (10)–(12) assume that no correlations among probabilities of insertions/deletions for various sized segments. This might be reasonable for large base RBS size considering the overall small probabilities of all insertions/deletions of various size. However, calculation of RBS length distribution with accurate full expansion is extremely difficult to accomplish even for single nucleotide indels as demonstrated above, to quantitatively and exhaustively solve for possible combinations of different indel sizes and probabilities is even more computationally prohibitive. As DPS in a given species is also evolving, these quantities are

Fig. 2. Evolution of Psum as a function of iteration (generation) for various combinations of insertion probability, indel probability difference, and truncation order(Pi , P, k). a) With given k = 20, Red: (Pi = 0.0101, P = .0 0 01); Green: (Pi = 0.0105, P = .0 0 05); Blue: (Pi = 0.011, P = .0 01), Purple: (Pi = 0.015, P = .005); b) With given Pi = 0.011 and P = .001, Red: (k = 10); Green:(k = 20); Blue: (k = 50). It is apparent that with given order, larger Pi and P would cause faster decreasing of Psum . With given Pi and P, smaller truncation order (k) results in faster decrease of Psum . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

5

Fig. 3. Distributions of RBS length after various number of generations of replication with corresponding averages indicated by filled circles on the purple line in Fig. 1. The relevant parameters are: Pi = 10−3 , P = 10−4 and k = 150. a) n = 30 0 0 0; b) n = 40 0 0 0; c) n = 50 0 0 0; d) n = 60 0 0 0; e) and f) are magnification of the very left part of c) and d) respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

not only species specific but also may vary over generations of a given species. Therefore, obtaining accurate indel parameters is an extremely difficult, if ever possible, task. While we are not able to quantitatively tackle these issues at present, it is qualitatively very clear that the asymmetry effect is more dramatic for insertion bias of larger segments, as both the number of increased expansion sites and the number of reduced deletable nucleotides are proportional to segment size. 3. Discussion The “C-value enigma” is more relevant for eukaryotes with genome size range over four orders of magnitude, while known bacterial genome size span approximately one order of magnitude (Casjens, 1998; Mira et al., 2001). The consistently observed deletion bias (Mira et al., 2001) in bacterial genome was explained as a balance between natural selection for smaller genomes, and for maintaining of essential function and gene duplication/acquisition events (Mira et al., 2001). Non-coding regions (spacers) in bacterial genomes are relatively small and are not sufficiently large to prime significant expansion of RBS one the one hand, on the other hand their size distributions are more or less flat within a narrow range.

These characteristics suggest that RBS, which has exponentially decaying length distributions, is probably not significant contributors for size evolution of bacterial genomes. The observed deletion bias does not necessarily implicate that insertion bias never occurred in bacteria. However, if there were bacteria that had insertion bias lasting long enough to experience explosive size growth, they would had been purged out by the strict size control due to energy (Lane and Martin, 2010) (or alternative unknown mechanisms). When considering indels of various sized segments for the same generation as in Eqs. (10)–(12), we argued that the noninteracting assumption among various indels is reasonable. When fixed insertions/deletions at different generations are considered, while probability of direct interaction between any fixed insertion in history and a specific indel event later on remains small, the accumulated interaction probability between a fixed genome insertion and any of indel events becomes more significant with increasing number of generations. In addition to provide more sites for further insertions, large segments inserted via various mechanisms may provide effective priming for small indels later on. The most important feature of genetic insertions and deletions is the fundamental asymmetry as revealed by our simple model.

6

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

Specifically, i) all insertion events increase the number of possible sites for further insertion while all deletion events decrease the number of deletable nucleotides in the next round of replication; ii) deletion of RBS is bound by zero size (when there is no RBS to delete) while no strict upper bound (except for the unknown accommodation limit of a harboring cell) exist for insertion as the largest allowable RBS may be quite massive from the knowledge of observed genomes and may well be species dependent. Apart from small indels, TEs and other forms of insertions of relatively large segments (e.g. symbiosys, virus invasion, whole and partial genome duplication, cross-over mistakes) are important contributors of genome size evolution (Petrov, 2001; Sessegolo et al., 2016; Lower et al., 2017; Dubin et al., 2018; Rodriguez and Arkhipova, 2018). Each of these mechanisms has its own characteristics with various importance for difference species. Regardless of its physical origin, all forms of insertions and deletions of genome share the fundamental asymmetry revealed by our simple model of single nucleotide indels. It is not clear at this stage how and to what extent various forms of genetic insertion mechanisms correlates with small indel biases (insertion bias, deletion bias or their lack of). However, it is undoubtable that all forms of large-segment insertions are not essential upon fixation, they therefore may serve as priming bases for explosive expansion due to insertion bias, gradually being partially or fully removed by deletion bias or slowly grow when indels are balanced. Deletion bias have been observed from analysis of many living organisms and deemed as possible explanation of genome size differences among closely related species (de Jong and Ryden, 1981; Graur et al., 1989; Petrov et al., 1996; Ophir and Graur, 1997; Kapusta et al., 2017). However, insertion bias have not been considered as an explanation for source of large genomes in “C-value enigma” for the following reasons. Firstly, they were intuitively deemed to cause symmetrical and small insignificant increase of genome sizes. Secondly, no persisting insertion bias have ever been revealed. Nonetheless, the non-essential deletable segments need to be generated at a first place, and insertion bias is one candidate for accomplish this event. It is clear from our model that, when there is sufficiently large RBS segments for priming, sustained effective insertion bias over a limited number of generations may generate very large RBS segments, and such insertion bias is not necessarily associated with complexity or identity of specific species. Insertion bias thus provides a possible complementary explanation for apparent genome size differences across very similar species for which selfish DNA are not decisively important or completely irrelevant. Similarly, insertion bias may occur once or repeatedly for some very simple organism and results in extremely large genomes. Indels therefore helps in explaining “C-value enigma”. We do not know the atomically detailed mechanism(s) of indel balance control for any specific given species. The complexity of molecular interactions (Melkikh and Khrennikov, 2017; Melkikh and Meijer, 2018) and the accompanying constraints imposed are tremendous. It is important to note that genetic evolution (insertions, deletions and mutations) have neither plan nor goal and therefore randomness is a good approximation. Nonetheless, the subsequent selection of such molecular events, which are certainly not random, are subject to all available constraints of molecular interactions microscopically and robustness of belonging organism macroscopically. Such constraints, of course, are constantly changing as the evolution and environment changes. Our simple model apparently lacks power to explain how such molecular complexities shape indels. The point of this simple model is, regardless of molecular mechanisms involved, the consequent size evolution of unbalanced indels abides by the fundamental asymmetry. Such asymmetry dictates that for any surviving species they must be in balanced or deletion bias state for overwhelmingly most of

the evolutionary history. For surviving large genomes that have significant contributions from insertion bias, the insertion bias state had to be turned on and then turned off shortly after (a limited number of generations), and multiple such occurrences are possible. Since this can not be planned by any species, our speculation is that for a surviving species with insertion bias engendered large genome, an insertion bias might be triggered by mutation(s) in relevant molecular machine(s) at some stage of evolution and was either suppressed or reversed later on. Therefore, insertion bias state was transient and consequently extremely hard to capture afterwards. Species that had insertion bias turned on but did not turned off would become extinct simply due to the eventually excessive size of its genome. These coronaries based on fundamental asymmetry of indels are on the one hand, consistent with, and on the other hand provide a workable explanation for the consistently observed deletion biases in various species. Therefore, it is difficult for anyone to succeed in finding the mechanism of insertion bias by looking into existing large genomes carefully. As whatever the cause was, it had most likely been removed. It is important to note that a segment of nucleotides inserted initially as RBS might evolve into functional, or even essential components of belonging genome, rendering identification of RBS segments a challenge. While whatever causes of possible RBS burst in present large genomes are likely not visible anymore, some of their consequences (segments of RBS) that has not been eroded away by deletions should remain. Comparison of similar species with dramatic contrasts of genome size might be a good starting point to dig such remaining RBS. Collectively, remains of RBS (only subject to structural restraints in terms of chromosome packing) should have larger entropy than other segments of genome that are subject to function and/or regulation constraints besides structures. However, it is not clear how to define an appropriate entropy measure for this purpose. One possible route for searching exact molecular causes of large genomes would be to develop hypothesis of specific molecular machineries, and test them with synthetic biology methodologies.

4. Conclusions Starting with a simple mathematical model of a thought perfect genome, we demonstrated that even single nucleotide insertion bias alone may result in arbitrarily large RBS. The explosive growth of RBS size at later stage of insertion bias was attributed to the fundamental indel asymmetry that may be generalized to any combinations of indel sizes and probabilities. In consistent with our model, persistent deletion bias were observed in many indel analyses. Tails of predicted RBS length distributions from this model qualitatively agree with length distribution of observed genomes, suggesting that various significant fractions of large genomes are possibly RBS at the time of integration into a genome. Insertion bias thus provides a complementary explanation of the long-standing “C-value enigma” besides the well accepted selfish DNA. The exponential growth of RBS caused by fundamental indel asymmetry implies that effective insertion bias can not last for long period on geological time scales, and consequently accompanying explosive expansions of RBS (and harboring genomes) may only occur sporadically. Based on this model, we predict that molecular mechanisms that generated any large genomes through effective insertion bias are highly likely to have been removed. Therefore, the most viable ways of investigating molecular mechanism of insertion bias is to develop hypotheses and validate them in synthetic biological systems instead of carefully looking into surviving large genomes for possible clues. Our model also provides a theoretical explanation for the observed proportion model of evolution.

Y. He, S. Tian and P. Tian / Journal of Theoretical Biology 482 (2019) 109983

Funding This work is supported by the National Key Research and Development Program of China (2017YFB0702500), and by National Natural Science Foundation of China (grant 31,270,758 to P.T. and grant 31401123 to S.T.). Availability of data and materials The datasets and codes used and/or analysed during the current study are available from the corresponding author on reasonable request. Author’s contributions P.T. conceived the study and coded the computation. S.T. and P.T. developed the mathematical model, Y.H. and P.T. performed analysis. S.T. and P.T. wrote the manuscript. Ethics approval and consent to participate Not applicable Declaration of Competing Interest The authors declare that they have no competing interests. References ˚ Agren, J.A., Greiner, S., Johnson, M.T.J., Wright, S.I., 2015. No evidence that sex and transposable elements drive genome size variation in evening primroses. Evolution 69 (4), 1053–1062. doi:10.1111/evo.12627. Bennett, M.D., Leitch, I.J., 2005. Chapter 2 - genome size evolution in plants. In: Gregory, T.R. (Ed.), The Evolution of the Genome. Academic Press, pp. 89–162. ISBN 978-0-12-301463-4, doi: 10.1016/B978- 012301463- 4/50 0 04-8. Casjens, S., 1998. The diverse and dynamic structure of bacterial genomes. Annu. Rev. Genet. 32 (1), 339–377. doi:10.1146/annurev.genet.32.1.339. Doolittle, F., Brunet, T.D.P., Linquist, S., Gregory, T.R., 2014. Distinguishing between “function” and “effect” in genome biology. Genome Biol. Evolut. 6 (5), 1234– 1237. doi:10.1093/gbe/evu098. Doolittle, W., 2013. Is junk DNA bunk? a critique of ENCODE. Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.1221376110. Doolittle, W.F., Sapienza, C., 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284, 601–603. Dubin, M.J., Mittelsten Scheid, O., Becker, C., 2018. Transposons: a blessing curse. Current Opin. Plant Biol. 42, 23–29. doi:10.1016/j.pbi.2018.01.003. Dunham, I., Kundaje, A., Aldred, S.F., Collins, P.J., Davis, C.A., Doyle, F., et al., 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489 (7414), 57–74. doi:10.1038/nature11247. Eddy, S.R., 2012. The c-value paradox, junk DNA and ENCODE. Current Biol. 22 (21), R898–R899. doi:10.1016/j.cub.2012.10.002. Eddy, S.R., 2013. The ENCODE project: missteps overshadowing a success. Current Biol. doi:10.1016/j.cub.2013.03.023. Elizabeth, P., 2012. Genomics encode project writes eulogy for junk DNA. Science 337 (September), 1159–1161. doi:10.1126/science.337.6099.1159. Freeling, M., Xu, J., Woodhouse, M., Lisch, D., 2015. A solution to the c-value paradox and the function of junk DNA: the genome balance hypothesis. Mol. Plant 8 (6), 899–910. doi:10.1016/j.molp.2015.02.009. Germain, P.L., Ratti, E., Boem, F., 2014. Junk or functional DNA? ENCODE and the function controversy. Biol. Philos. 29 (6), 807–831. doi:10.1007/ s10539- 014- 9441- 3. Graur, D., Shuali, Y., Li, W.H., 1989. Deletions in processed pseudogenes accumulate faster in rodents than in humans. J. Mol. Evolut. 28 (4), 279–285. doi:10.1007/ BF02103423. Graur, D., Zheng, Y., Price, N., Azevedo, R.B.R., Zufall, R.A., Elhaik, E., 2013. On the immortality of television sets: ”function” in the human genome according to the evolution-free gospel of encode. Genome Biol. Evolut. 5 (3), 578–590. doi:10. 1093/gbe/evt028.

7

Gregory, T.R., 2003. Is small indel bias a determinant of genome size. Trend Genet. 19, 485–488. doi:10.1016/S0168-9525(03)00192-6. Gregory, T.R., 2004. Insertion-deletion biases and the evolution of genome size. Gene 324 (1–2), 15–34. doi:10.1016/j.gene.2003.09.030. Gregory, T.R., 2005. Chapter 1 - genome size evolution in animals. In: Gregory, T.R. (Ed.), The Evolution of the Genome. Academic Press, pp. 3–87. ISBN 978-0-12301463-4, doi: 10.1016/B978- 012301463- 4/50 0 03-6. de Jong, W.W., Ryden, L., 1981. Causes of more frequent deletions than insertions in mutations and protein evolution. Nature 290 (12), 157–159. Kapusta, A., Suh, A., Feschotte, C., 2017. Dynamics of genome size evolution in birds and mammals. Proc. Natl Acad. Sci. USA 114 (8), E1460–E1469. doi:10.1073/pnas. 1616702114. Kellis, M., Wold, B., Snyder, M.P., Bernstein, B.E., Kundaje, A., Marinov, G.K., et al., 2014. Defining functional DNA elements in the human genome. Proc. Natl. Acad. Sci. USA 111 (17), 6131–6138. doi:10.1073/pnas.1318948111. Lane, N., Martin, W., 2010. The energetics of genome complexity. Nature 467 (7318), 928–934. doi:10.1038/nature09486. Lower, S.S., Johnston, J.S., Stanger-Hall, K.F., Hjelmen, C.E., Hanrahan, S.J., Korunes, K., et al., 2017. Genome size in north american fireflies: substantial variation likely driven by neutral processes. Genome Biol. Evolut. 9 (6), 1499–1512. doi:10.1093/ gbe/evx097. Melkikh, A.V., Khrennikov, A., 2017. Molecular recognition of the environment and mechanisms of the origin of species in quantum-like modeling of evolution. Prog. Biophys. Mol. Biol. 130, 61–79. doi:10.1016/j.pbiomolbio.2017.04.008. Melkikh, A.V., Meijer, D.K.F., 2018. On a generalized levinthal’s paradox: the role of long- and short range interactions in complex bio-molecular reactions, including protein and DNA folding. Prog. Biophys. Mol. Biol. 132, 57–79. doi:10.1016/j. pbiomolbio.2017.09.018. Mira, A., Ochman, H., Moran, N.A., 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet. 17 (10), 589–596. doi:10.1016/S0168-9525(01) 02447-7. Niu, D.K., Jiang, L., 2013. Can ENCODE tell us how much junk DNA we carry in our genome? Biochem. Biophys. Res.Commun. 430 (4), 1340–1343. doi:10.1016/ j.bbrc.2012.12.074. Nóbrega, M.A., Zhu, Y., Plajzer-Frick, I., Afzal, V., Rubin, E.M., 2004. Megabase deletions of gene deserts result in viable mice. Nature 431, 988. doi:10.1038/ nature03022. Ohno, S., 1972. So much ”junk” dna in our genome. Brookhaven Symp. Biol. 23, 366–370. Oliver, M., Petrov, D., Ackerly, D., Falkowski, P., Schofield, O., 2007. The mode and tempo of genome size evolution in eukaryotes. Genome Res. 19, 594–601. doi:10.1101/gr.6096207. Ophir, R., Graur, D., 1997. Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205 (1–2), 191–202. Junk DNA: The Role and the Evolution of Non-Coding Sequences. doi: 10.1016/S0378-1119(97) 00398-3. Orgel, L.C.F., 1980. Selfish DNA: the ultimate parasite. Nature 284 (17), 604–607. Palazzo, A.F., Gregory, T.R., 2014. The case for junk DNA. PLoS Genet. 10 (5). doi:10. 1371/journal.pgen.1004351. Palazzo, A.F., Lee, E.S., 2015. Non-coding RNA: what is functional and what is junk? Frontiers in Genetics 5 (JAN), 1–11. doi:10.3389/fgene.2015.0 0 0 02. Petrov, D., 2001. Evolution of genome size: new approaches to an old problem. Trend. Genet. doi:10.1016/S0168- 9525(00)02157- 0. Petrov, D., 2002. Mutational equilibrium model of genome size evolution. Theor. Populat. Biol. 61 (4), 531–544. doi:10.10 06/tpbi.20 02.1605. Petrov, D., Lozovskaya, E.R., Hartl, D.L., 1996. High intrinsic rate of DNA loss in drosophila. Nature 384 (6607), 346–349. doi:10.1038/384346a0. Robicheau, B.M., Susko, E., Harrigan, A.M., Snyder, M., 2017. Ribosomal RNA genes contribute to the formation of pseudogenes and junk DNA in the human genome. Genome Biol. Evolut. 9 (2), 1–43. doi:10.1093/gbe/evw307. Rodriguez, F., Arkhipova, I.R., 2018. Transposable elements and polyploid evolution in animals. Current Opin. Genet. Dev. 49, 115–123. doi:10.1016/j.gde.2018.04. 003. Senra, M.V., Sung, W., Ackerman, M., Miller, S.F., Lynch, M., Soares, C.A.G., 2018. An unbiased genome-wide view of the mutation rate and spectrum of the endosymbiotic bacterium teredinibacter turnerae. Genome Biol. Evolut. 10 (3), 723–730. doi:10.1093/gbe/evy027. Sessegolo, C., Burlet, N., Haudry, A., 2016. Strong phylogenetic inertia on genome size and transposable element content among 26 species of flies. Biol. Lett. 12 (8), 2016–2019. doi:10.1098/rsbl.2016.0407.