Gene 276 (2001) 57–72 www.elsevier.com/locate/gene
Delineating relative homogeneous G 1 C domains in DNA sequences Wentian Li* Laboratory of Statistical Genetics, The Rockefeller University, 1230 York Avenue, Box 192, New York, NY 10021, USA Received 26 March 2001; received in revised form 20 June 2001; accepted 10 August 2001 Received by G. Bernardi
Abstract The concept of homogeneity of G 1 C content is always relative and subjective. This point is emphasized and quantified in this paper using a simple example of one sequence segmented into two subsequences. Whether the sequence is homogeneous or not can be answered by whether the two-subsequence model describes the DNA sequence better than the one-sequence model. There are at least three equivalent ways of looking at the 1-to-2 segmentation: Jensen–Shannon divergence measure, log likelihood ratio test, and model selection using Bayesian information criterion. Once a criterion is chosen, a DNA sequence can be recursively segmented into multiple domains. We use one subjective criterion called segmentation strength based on the Bayesian information criterion. Whether or not a sequence is homogeneous and how many domains it has depend on this criterion. We compare six different genome sequences (yeast S. cerevisiae chromosome III and IV, bacterium M. pneumoniae, human major histocompatibility complex sequence, longest contigs in human chromosome 21 and 22) by recursive segmentations at different strength criteria. Results by recursive segmentation confirm that yeast chromosome IV is more homogeneous than yeast chromosome III, human chromosome 21 is more homogeneous than human chromosome 22, and bacterial genomes may not be homogeneous due to short segments with distinct base compositions. The recursive segmentation also provides a quantitative criterion for identifying isochores in human sequences. Some features of our recursive segmentation, such as the possibility of delineating domain borders accurately, are superior to those of the moving-window approach commonly used in such analyses. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Compositional homogeneity; Domains-within-domain phenomenon; Isochore; Isochore borders; Segmentation
1. Introduction The completion of the draft sequence of the human genome (Venter et al., 2001; Lander et al., 2001) as well as genome sequences of other organisms makes it possible to address many primary-sequence-based questions in a clear, precise form. For example, are sequences in a genome compositionally homogeneous? Are isochores homogeneous? Are different chromosome sequences in a genome similar? Are different genome sequences similar, etc.? The central issue in these questions is a definition of ‘homogeneity’. Besides the common knowledge that randomly generated sequences are homogeneous, it is not widely appreciated that the concept of homogeneity is relative, and strictly speaking, subjective. A sequence that deviates somewhat from the random sequence can be claimed to be inhomogeAbbreviations: BIC, Bayesian information criterion; bp, base pair; kb, kilo (1000) base pairs; LLR, log likelihood ratio; Mb, mega (1,000,000) base pairs; MHC, major histocompatibility complex * Tel.: 11-212-327-7977; fax: 11-212-327-7996. E-mail address:
[email protected] (W. Li).
neous (heterogeneous) by one standard, and homogeneous by another. Of course, it is possible that a sequence can be so clearly inhomogeneous that it would never be considered to be homogeneous by any standard. Recent work has shown, however, that many eukaryotic DNA sequences can be in a ‘gray area’: not only could there be uncertainties as to whether a given sequence is homogeneous or not, but also such uncertainty can be traced to ever smaller length scales. In other words, a seemingly homogeneous sequence under one standard can be inhomogeneous under a slightly relaxed standard. After one attempts to split the sequence into homogeneous segments, each segment can experience the same ambivalence, and such a cycle could continue to smaller and smaller domains. Several phrases have been used to describe this phenomenon: ‘domains within domains’ (Li et al., 1994; BernaolaGalva´n et al., 1996; Li, 1997a,b), ‘fluctuation and variation at different length scales’, ‘complex compositional heterogeneity’ (Roma´n-Rolda´n et al., 1998; Li, 1997a,b), etc. The concept of ‘domains within domains’ also leads to many other terms used to describe DNA sequences: hierarchical patterns, fractal landscapes (Peng et al., 1992), non-
0378-1119/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0378-111 9(01)00672-2
58
W. Li / Gene 276 (2001) 57–72
trivial long-range correlations (Li, 1992; Li and Kaneko, 1992), one-over-f (1/f ) spectra (Li and Kaneko, 1992; Voss, 1992), etc. The very existence of this type of pattern in DNA sequences makes it difficult for the traditional fixed-length moving-window approach to capture the compositional complexity. It is no wonder that debates remain to this day on the topic of compositional homogeneity (Nekrutenko and Li, 2000; Ha¨ ring and Kypr, 2001; Lander et al., 2001). In this paper, it will be shown that there are appropriate tools for segmenting a DNA sequence into the right number of domains, large or small. Complexity of a spatial compositional fluctuation can be revealed by gradually changing the definition of sequence homogeneity. The paper is organized as follows: the Methods section (Section 2) includes descriptions of simple 1-to-2 segmentation, recursive segmentation, and stopping criteria for recursions; the Results section (Section 3) includes segmentation results of six sequences (two yeast chromosomes, one bacterium genome sequence, and three long human sequences); the Discussion section (Section 4) contains four topics: comparison of window-less and window-based analysis, the relative nature of a homogeneity test, the debate on the existence of isochores, and some caveats of the segmentation procedure.
2. Methods 2.1. One-step segmentation in the information theoretic framework Concerning homogeneity of a DNA sequence, the simplest situation is whether a sequence is better described by two, left and right, random subsequences with different base compositions, or by the description of one random sequence with one set of base compositions. This situation is the simple one-step segmentation, also called a 1-to-2 segmentation (Bernaola-Galva´ n et al., 1996; Oliver et al., 1999). There are at least three equivalent ways (but from different perspectives) to examine this question, and one can easily see that a subjective decision must be made. From the first perspective, we would like to find a segmentation point (cutting point, partition point, etc.) that maximizes the base compositions between the left and right subsequences. Once such a point is found, we then ask whether the base composition difference between the left and right subsequence is ‘large enough’. In BernaolaGalva´ n et al. (1996) and Oliver et al. (1999), it is proposed that an information-theoretic quantity called Jensen–Shannon divergence DJS (Lin, 1991) be used as a measure of the difference between two base compositions (at a position 0 , i , N): i N2i Hright ð1Þ DJS ðiÞ ¼ H 2 Hleft 2 N N P where H is the entropy 2 a¼ðA;C;G;TÞ pa logðpa Þ of the
whole sequence ({pa }, a ¼ ðA; C; G; TÞ are base compositions); Hleft and Hright are the entropies of the left and right subsequence. It can be shown that DJS has many desirable properties as a distance measure (Lin, 1991; BernaolaGalva´ n et al., 2001). The estimated value of DJS is denoted by D^ JS , where pa s are estimated by p^a ¼ Na =N. Once D^ JS ðiÞ is calculated over all possible segmentation points, the maximum difference between the two base compositions in the left and right subsequences is D^ JS ø max D^ JS ðiÞ i
ð2Þ
A sequence can be declared homogeneous if D^ JS is below a threshold value. Such a threshold value may be obtained, for example, empirically by simulating random sequences. 2.2. One-step segmentation in the hypothesis testing framework The second perspective may shed light on a theoretical value for the threshold value. We may consider the question of homogeneity as a hypothesis test. The hypothesis to be tested (usually to be rejected) is the ‘null’ hypothesis that the sequence is a random sequence. A common framework for a hypothesis test is the (maximum log) likelihood ratio test (e.g. Cox and Hinkley, 1974; Lehmann, 1991). In the likelihood ratio test, the likelihood of the null model (L1) and that of the alternative model (L2) are calculated, and maximized by adjusting free parameters in the likelihood function (resulting in L^ 1 , L^ 2 ). If the null model is true, the 2log with base e) of the ratio of the two obeys the x 2 distribution with K2 2 K1 degrees of freedom, where K2 and K1 are the number of parameters in the alternative and the null model (e.g. Cox and Hinkley, 1974; Lehmann, 1991): 2log
L^ 2 , x2ðdf¼K2 2K1 Þ L^ 1
ð3Þ
Several assumptions are required for this result: the sample size is large (asymptotic), the null hypothesis appears as a special case of the alternative hypothesis, the maximum likelihood is achieved in the interior of the parameter space, and free parameters in both null and alternative models are continuous variables. Q segmentation, L1 ¼ a¼ðA;C;G;TÞ pNa a , L2 ¼ Q For our 1-to-2 Na;l Na;r and the (log) a¼ðA;C;G;TÞ pa;l pa;r (l and r for left and right), P maximized likelihoods are log(Lˆ1) ¼ aNalog(Na/ ˆ , log(Lˆ2) ¼ 2iH ˆ left 2 (N 2 i)H ˆ right. The (2log) N) ¼ 2NH likelihood ratio can have two forms: if the partition point i is not considered as a free parameter, then LLRðiÞ ¼ 2log
L^ 2 ðiÞ ¼ 2N D^ JS ðiÞ L^ 1
ð4Þ
If we also maximize the second likelihood over the partition point i, then LLR ¼ 2log
maxi L^ 2 ðiÞ ¼ 2N D^ JS L^ 1
ð5Þ
W. Li / Gene 276 (2001) 57–72
By Eq. (3), the ‘null distribution’ (i.e. distribution of the test statistic if the null hypothesis is true) of LLR(i) is x2ðdf¼3Þ (the three degrees of freedom are due to the fact that two random subsequences need six base composition parameters, whereas one random sequence needs three base composition parameters). At a first glance, the null distribution of LLR is x2ðdf¼4Þ because now we have one more free parameter: the partition point i. Nevertheless, this parameter is not continuous and the condition for Eq. (3) is not satisfied. We do not know an analytic expression of the null distribution of LLR (‘appears completely intractable’; Pettitt, 1980), but this distribution should depend on sequence length N. To see this, assume the LLR(i) series is approximated by a random walk in position i. Since the variance of a random walk increases with time (or here, sequence length N), the standard deviation and the maximum value maxiLLR(i) increase with the square root of N. Repeating such a walk many times, one can obtain the distribution for maxiLLR(i). It is clear that such a null distribution is a function of N. This conclusion was known in a branch of the field of statistics called ‘change-point problem’ (e.g. James et al., 1987), and was also observed in computer simulations (Bernaola-Galva´ n et al., 2001). If the null distribution of LLR is known exactly, there is a one-to-one correspondence between a value of LLR and the tail area under the null distribution beyond this value (also called the p value). It is common practice in hypothesis testing to set a threshold for the p value (also called significance level) and the corresponding threshold for LLR can be derived from the null distribution. 2.3. One-step segmentation in the model selection framework The hypothesis testing framework, which originated from Neyman and Pearson’s work in the 1930s, has attracted criticism for years (e.g. Berger and Berry, 1988; Royall, 1997). One of the alternative paradigms is the model selection framework (e.g. Burnham and Anderson, 1998), which is our third perspective on the 1-to-2 segmentation. In a model selection framework, one model is better than another not only because it fits the data well (higher likelihood), but also because it does so while keeping the model simple. A model selection criterion is based on a balance between how well the model fits the data and the model complexity. One quantity of particular interest to us is the Bayesian information criterion (BIC), defined as ^ 1 logðNÞK BIC ¼ 22logðLÞ
ð6Þ
where L^ is the maximized likelihood, N is the sequence length (sample size), and K is the number of parameters in the model. The model with the smaller BIC value is preferred and should be selected. In our 1-to-2 segmentation, the first model is simply the random sequence model (M1), and the second model is where the sequence consists of two random subsequences
59
with different base compositions (M2). Repeating the same calculation of maximum likelihoods, it can be shown that BICs of the two models are: BIC1 ¼ 2N H^ 1 3logðNÞ and BIC2 ¼ 2½iH^ left 1 ðN 2 iÞH^ right 1 7logðNÞ, where the partition point is considered as a parameter in the second model (see Section 4.4 for more discussion). If we require BIC2 , BIC1 , we have the following criterion for continuing the recursive segmentation (Li, 2001a,b): 2N D^ JS . 4logðNÞ
for four 2 symbol sequences
2N D^ JS . 2logðNÞ
for two 2 symbol sequences
ð7Þ
In all three perspectives, we require that D^ JS or 2N D^ JS exceed a given threshold value before we claim the sequence to be inhomogeneous. The subjective nature of this decision making may cause confusion in discussing homogeneity of DNA sequences if the heterogeneity in the sequence is subtle and complex. Only when a threshold is chosen to define what is considered homogeneous can we claim that the sequence is not homogeneous. 2.4. Many-step segmentation and domain borders If the question of whether a DNA sequence can be decomposed into two relative homogeneous subsequences is a decision-making problem, the same is true for the problem of segmenting the sequence into many relative homogeneous subsequences. Attempting all possible numbers of subsequences and their borders is computationally not feasible, and in computer science, it is typical to use a local-optimal solution instead of the brute-force search for a global solution. One such local-optimal solution is the divide-andconquer algorithm where the original problem is broken into two or more sub-problems, each similar to the original problem but smaller in size (e.g. Cormen et al., 1990). For our problem of segmenting a DNA sequence into many subsequences, one can consider the case of two subsequences first, then each subsequence is in turn segmented into two subsubsequences, etc. This recursive 1-to-2 segmentation (or many-step segmentation) of DNA sequences was first proposed in Bernaola-Galva´ n et al. (1996). Given a fixed decision criterion as to whether a sequence can be segmented into two subsequences, recursive segmentation will eventually lead to many subsequences, each relatively homogeneous as compared to its parental sequence. If the criterion is too relaxed, the resulting relative homogeneous domains are small (and can be as small as individual bases). On the other hand, if the criterion is too stringent, we may not be able to segment the sequence even once. Since segmentation points give us borders between neighboring domains, the overall domain structure depends on our chosen threshold value, and is thus subjective. In order to answer a question such as whether isochores exist or whether isochores are homogeneous (Nekrutenko and Li, 2000; Ha¨ ring and Kypr, 2001; Lander et al., 2001), a threshold has to be chosen beforehand.
60
W. Li / Gene 276 (2001) 57–72
Recursive segmentation and the resulting domain borders are superior to the commonly used moving-window approach. In a window-based approach, the window size (and sometimes, the moving distance) is predetermined. This seemingly minor technical detail actually may affect the conclusion: heterogeneity within a window will not be detected; the moving distance changes the degree of independence between the neighboring windows. Any windowbased test of compositional homogeneity has therefore to be repeated for many choices of window size, which may not be practical. Recursive segmentation has an advantage over the fixed-window-based method in that it can find the domain border more accurately. In recursive segmentation, there is no prior preference for the number of domains, and the determination of the domain border can be accurate to the base level. In window-based approaches, however, the same accuracy can only be reached when the moving distance is a single base. There are many applications of recursive segmentation to DNA sequence analysis. Applying segmentation directly to the original sequence will delineate domains with distinct base compositions. For regions with complicated base composition structures, such as telomere and centromere, recursive segmentation is effective in finding subtle patterns. For detecting G 1 C homogeneous (isochore) regions, the original sequence is converted to a W-S binary sequence (W ¼ {A; T}, S ¼ {C; G}). If the original sequence is converted to a binary sequence indicating the presence or absence of 5 0 -CG-3 0 dinucleotide, the recursive segmentation will detect CpG islands. Another potentially interesting application is to convert a DNA sequence to a 12-symbol sequence with information on both base and three codon/ phase positions (Bernaola-Galva´ n et al., 2000). Domain borders in this 12-symbol sequence are expected to be correlated to the coding–non-coding borders. In this paper, only detection of G 1 C domains will be discussed; other applications can be found in Li et al. (2001). We use a stopping criterion based on BIC because the null distribution for 2N D^ JS is not known analytically, and because the model selection framework is more flexible and can be generalized to other situations. BIC criterion has a desired property that it is ‘dimensionally consistent’ in (Burnham and Anderson, 1998), i.e. the decision to reject or accept the null hypothesis does not change with N. Other stopping criteria may not share this property with BIC. Sometimes, it is preferable to have an even more stringent stopping criterion than BIC. For this purpose, we propose a quantity called ‘BIC-based segmentation strength’ s, which measures the percentage increase of LLR from the BIC threshold: s;
2N D^ JS 2 2logðNÞ . s0 $ 0; 2logðNÞ
ð8Þ
for two 2 symbol sequences This segmentation strength has to be larger than a chosen
threshold s0 for the segmentation to be significant. If the threshold s0 is 0, we return to the BIC stopping criterion. If s0 is strictly positive, we have chosen a more stringent criterion than BIC. In the following, we will investigate which s0 values are appropriate for determining G 1 C domains in DNA sequences at certain length scales.
3. Results 3.1. Saccharomyces cerevisiae chromosome III The yeast chromosome III was the first complete eukaryotic chromosome sequenced (Oliver et al., 1992). It is well known that there exist two (G 1 C)-rich regions near the middle of the two arms (Sharp and Lloyd, 1993). Our recursive segmentation shows that there are 30 domains when s0 is set to 0, and 23, eight and seven domains when s0 is set at 0.3, 2, and 3, respectively. Fig. 1 shows the segmentation result with s0 ¼ 0:3, including the following information: (G 1 C)% in a moving window (window size, 20 kb; moving distance, 2 kb), domain borders of 23 domains (vertical dashed lines), (G 1 C)% of these domains (horizontal solid lines), and segmentation strength s (bars in the lower part of the plot). Note that the particular chosen window size and moving distance are for the purpose of illustrating two (G 1 C)-rich regions (another window size and moving distance may not show this feature; e.g. Li et al., 1994). There can be a difference between the (G 1 C)% in segmented domains (horizontal lines in Fig. 1) and those in a moving window (curve in Fig. 1) at a corresponding position, reflecting the fact that segmented domains can be much smaller than the window size. For example, the (G 1 C)% in two segmented telomere regions is close to 55%, but the (G 1 C)% in the 20 kb windows is never close to that value. There is another segmented 315-base domain (from position 57,011 to 57,325) whose (G 1 C)% is a low value of 18%. The moving-window approach at this window size is completely unable to identify these smaller domains with larger (G 1 C)% fluctuations. All 23 domains shown in Fig. 1 are real, in the sense that nearby domains do have large different (G 1 C)% under the given criterion for being different, e.g. the jump of (G 1 C)% from 36.6% in the long domain (98 kb) in the middle of the chromosome to 47.7% in a much shorter domain (8 kb) within the right-arm (G 1 C)-rich peak. If we are not interested in these smaller scale fluctuations (even though they are real), we could increase the stopping criterion s0. Table 1 shows the result from a more stringent stopping criterion (s0 ¼ 3) with only seven domains. The first domain is clearly related to the left telomere sequence. The fifth domain with 7986 bases is a local fluctuation within the (G 1 C)-rich domain in the right arm. If these two smaller domains are ignored, we obtain exactly the same five domains as Sharp and Lloyd (1993). The advantage of our approach is that we could determine the domain
W. Li / Gene 276 (2001) 57–72
61
Fig. 1. A recursive segmentation result on yeast chromosome III. Sequence length N ¼ 315; 341 bases; strength threshold s0 is set at 0.3; there are 23 segmented domains. A similar result using a more stringent stopping criterion (s0 ¼ 3) is shown in Table 1. The following information is included: (1) (G 1 C)% in a moving window (window length, 20 kb; moving distance, 2 kb); domain borders (vertical dashed lines); (G 1 C)% in segmented domains (horizontal solid lines); and segmentation strength s (vertical bars in the lower part of the plot). Some domains are marked.
borders to the base level (positions 45,862, 108,358, 206,887, and 270,930) (see Section 4.4 for more discussion). Such a task could be more difficult with a movingwindow approach. When the s0 threshold is increased from around 3 to 9, the five-domain pattern in Sharp and Lloyd (1993) becomes insignificant. Since the concept of homogeneity is always relative, we can say that the chromosome III sequence is homogeneous at around s0 ¼ 9. Note that since the first onestep segmentation has a strength of around 3, setting s0 ¼ 4 is enough to stop the recursive segmentation. Nevertheless, this first segmentation is more susceptible to chance events (see Section 4.4 for more discussion). If we start by removing the left telomere sequence first, only an s0 of around 9 can stop the segmentation at position 206,887. 3.2. Saccharomyces cerevisiae chromosome IV Although the 16 yeast chromosomes have many statistical features in common (Li et al., 1998), a similar organization into homogeneous G 1 C domains does not appear to be one of them. For example, it was noticed that two (G 1 C)-rich peaks in chromosome III do not exist in other yeast chromosomes (Dujon, 1996; Bradnam et al., 1999). One of the claims in Nekrutenko and Li (2000) is that yeast chromosomes are homogeneous, at least chromosome IV. In order to compare our result with that in Nekrutenko and Li (2000), we show here the recursive segmentation result for chromosome IV (Jacq et al., 1997). The number of segmented domains is 13, seven and three, respectively, with s0 set at 0, 0.3 and 1. These numbers are
much smaller than those in chromosome III, despite the fact that its chromosome size is almost five times longer. It indicates that it is much easier to claim that yeast chromosome IV is homogeneous than chromosome III. Fig. 2 shows the segmentation result for chromosome IV when s0 ¼ 0. Indeed, the (G 1 C)% difference between the neighboring segmented domains in Fig. 2 is less than 1.5%, whereas that in chromosome III can be as large as 4.1%. Of course, these differences should also be considered in the context of different domain sizes. Table 2 shows the segmentation result when s0 is set at 1, with only three domains. The last 7 kb domain is clearly related to the right telomere, the first domain (242 kb) is the same as that in Fig. 2, and the largest domain (1.28 Mb) is merged from the three domains in Fig. 2 (at the s0 ¼ 0 level). Note again the difference between the two domains in Table 2: 38.6 and 37.7%, a mere 1%. If one starts a recursive segmentation by first removing the right telomere sequence, setting s0 ¼ 1:5 is enough to make the chromosome IV homogeneous. This s0 ¼ 1:5 criterion is clearly much more relaxed than the s0 ¼ 9 value for chromosome III. We should caution that claiming that yeast chromosome IV is homogeneous does not imply that chromosome III is also homogeneous, at least not at the same segmentation strength level. In Li et al. (1998) we used the term ‘compositional heterogeneity’ to describe the non-randomness of the sequence, the presence of 1/f spectra, and slow-decaying variance with the window size. The term ‘long-range nonrandomness’ is probably a more appropriate term to describe this situation than the term ‘heterogeneity’. If a more commonly accepted meaning of heterogeneity is used, i.e.
62
W. Li / Gene 276 (2001) 57–72
larger variance between the group than within groups, then yeast chromosomes may not be ‘heterogeneous’. If we define ‘groups’ as segmented domains at a stringent condition, and ‘members of the group’ as segmented domains at a relaxed condition, the standard analysis of variance (ANOVA) test can be carried out to test homogeneity (Oliver and Li, 1998). This ‘two-level segmentation test’ showed that chromosomes III and VIII failed the test at the 0.005 significance level, chromosome V failed at the 0.05 level, chromosome XII failed at the 0.1 level, and the other chromosomes passed this homogeneity test (Oliver and Li, 1998). 3.3. Bacterium Mycoplasma pneumoniae Bacterium Mycoplasma pneumoniae is a human pathogen that causes ‘atypical’ pneumonia in older children and young adults (Himmelreich et al., 1996). This sequence is selected here because we have observed that it failed a homogeneity test (Oliver and Li, 1998), even though bacterial sequences are usually considered to be homogeneous. When recursive segmentation is applied to the M. pneumoniae sequence, the number of segmented domains is 109, 58, 22, and nine, respectively, when s0 is set at 0, 0.5, 2, and 5. Fig. 3 shows the segmentation result at s0 ¼ 2 (see also Table 3). The striking feature of Fig. 3 is that 12 out of the total number of 22 domains at this segmentation strength level are small (a few kilobases) and have much higher (.50%) or lower (,35%) G 1 C content than the genome-wide value. The larger domains usually exhibit an average G 1 C content (around 40%). Small domains with extreme (G 1 C)% may well be external elements, repetitive sequences, or horizontally transferred segments.
It is clear that these unusually high or low (G 1 C)% domains with a few kilobases have a direct implication for a homogeneity study. First of all, they may be the reason for the sequence to fail a homogeneity test (Oliver and Li, 1998). In this test (‘two-level segmentation test’), smaller domains segmented at a relatively more relaxed criterion are considered to be members of a larger domain (group), and whether groups are different (homogeneous) is decided by comparing the variance within a group and the variance between groups. The existence of several nearby extremevalue kilobase-sized domains will make that group different from other groups, thus failing the homogeneity test. Secondly, when smaller domains are removed, if strong segmentations (those with large s values) only occur between a large domain with an average (G 1 C)% (40%) and a small domain with an extreme (G 1 C)% (either 35 or 50%), the sequence will become homogeneous. In Fig. 3, for example, no large domains are next to each other except the two at the right end (113 and 74 kb, respectively). Since the strength of this segmentation is s ¼ 5:2, we expect that after smaller domains are removed, the level of homogeneity of M. pneumoniae is perhaps similar to yeast chromosome III (and slightly lower). In Ha¨ ring and Kypr (2001), it was claimed that the degree of heterogeneity of bacterium E. coli is comparable to that of the human sequence. We caution that the level of heterogeneity in a bacterium sequence can be greatly reduced if the small elements with high/low G 1 C content are first removed. 3.4. Human major histocompatibility complex The human major histocompatibility complex (MHC)
Fig. 2. Similar to Fig. 1, but for yeast chromosome IV. Sequence length N ¼ 1; 531; 974 bases; strength threshold s0 is set to 0; there are 13 domains. A similar result using a more stringent stopping criterion (s0 ¼ 1) is shown in Table 2.
W. Li / Gene 276 (2001) 57–72
sequence is located at chromosome 6p and contains more than 200 genes, many of them with immune system functions (Beck et al., 1999). The human MHC was discovered when certain alloantigens called human leukocyte antigens (HLA) were genetically mapped to that region. For this reason, the human MHC sequence is also called the HLA region. MHC is a sequence whose isochore structure has been studied in detail (Fukagawa et al., 1995, 1996; Stephens et al., 1999). Among the four regions identified, (telomere) classical class I, class III, classical class II, extended class II (centromere) (Beck et al., 1999), two regions are well characterized isochores (III and classical II) (Fukagawa et al., 1995). When recursive segmentation is applied to the MHC sequence, the number of domains is 52, 25, and 11, respectively, with s0 set at 5, 10, and 20. To reproduce the known four domains in Beck et al. (1999), the s0 has to be set to approximately 55. Fig. 4 and Table 4 illustrate the segmentation results at s0 ¼ 10 and s0 ¼ 20, respectively. The class III isochore (642 kb) is clearly the most homogeneous domain among the four, and the class II isochore (901 kb) is the second most homogeneous domain. There are plenty of sub-domains in the class I region (1.84 Mb): when recursive segmentation is applied, two segmentations have a strength around 50, and three segmentations have a strength around 20. Table 4 shows that at s0 ¼ 20, the class I region consists of five large domains (sizes 299, 534, 658, 249, and 99 kb, with (G 1 C)% at 46.1, 43.2, 48.5, 43.1, and 47.1%). There is also a smaller domain (size 3 kb) with 70.1% of (G 1 C). Recall that the strongest segmentation in yeast chromosome III has a strength around 8.9; thus, the class I domain is more heterogeneous than the whole yeast chromosome III.
63
After viewing Fig. 4, it is not surprising that the border between class III and class II at position 2,483,966 is the best studied isochore border (Fukagawa et al., 1995, 1996). Not only is there a huge drop of (G 1 C)% (from 51.9 to 41.1%), but also the lengths of the relative homogeneous domains on both sides of the border are long. If s0 is reduced to 5, the class III isochore can only be segmented three more times: the first cut is around the two-third portion with strength 6.2, and the remaining two cuts identify much smaller regions (5.8 and 34.6 kb, respectively, with strengths of s ¼ 19:5 and 24.5). When these smaller regions are discarded, the class III domain in the MHC sequence is as homogeneous as yeast chromosome III (but less homogeneous than yeast chromosome IV). 3.5. Longest contigs in human chromosomes 21 and 22 To compare the result obtained in Nekrutenko and Li (2000), we have segmented the longest contig in human chromosome 21 (Hattori et al., 2000) and that of chromosome 22 (Dunham et al., 1999): ch21 contig NT002836 (28.515 Mb or 85% of the sequenced chromosome) and ch22 contig NT001454 (22.964 Mb or 68.8% of the sequenced genome). The number of segmented domains in the chromosome 21 contig is 126, 62, 18, and nine, respectively, with s0 set at 5, 10, 20, and 50. Fig. 5 shows the results of segmentation at s0 ¼ 10, 20, and Table 5 shows results at s0 ¼ 50. The most dominant segmentation is at position 17,883,014 with strength s ¼ 3381. This segmentation effectively separates the contig into the (G 1 C)-poor left sequence and the (G 1 C)-rich right sequence. Nevertheless, this segmenta-
Fig. 3. Similar to Fig. 1, but for bacterium sequence of M. pneumoniae. Sequence length N ¼ 816; 394 bases; strength threshold s0 is set to 2; there are 22 domains shown. A similar result using a more stringent stopping criterion (s0 ¼ 5) is shown in Table 3.
64
W. Li / Gene 276 (2001) 57–72
Fig. 4. Similar to Figs. 1 and 2, but for human MHC sequence. Sequence length N ¼ 3; 673; 778 bases; strength threshold s0 is set to 10; there are 25 domains. A similar result using a more stringent stopping criterion (s0 ¼ 50) is shown in Table 4. The four known isochores (telomere is on the left side and centromere is on the right side), class I, class III, class II, and extended class II, as well as the isochore border between class III and class II are marked in the plot.
tion does not tell the whole story of the base composition variation in the sequence: the left sequence can be segmented further into domains with an even lower G 1 C content (a 7.1 Mb domain with (G 1 C)% of 35.1%). The human chromosome 21 contig illustrates two themes we would like to emphasize: (1) the domains-within-domain phenomenon and (2) the failure of the moving-window approach to identify smaller domains. Besides the 7.1 Mb domain on the (G 1 C)-poor side of the first segmentation with an even lower (G 1 C)%, more complicated subdomain structures emerge as the stopping criterion is relaxed from s0 ¼ 20 to s0 ¼ 10. There are already 15 sub-domains in the first 1 Mb region (from position 676,168 to 1,161,544), some of which have extremely high (G 1 C)% (see lower plot of Fig. 5). These smaller domains (2.5, 1, 8, 10.9, and 31.3 kb, with (G 1 C)% of 65.8, 69.5, 59.9, 61, and 52.9%) would not be detected with our moving window at the 800 kb size. There were not many annotations concerning (G 1 C)% domains in the human chromosome 21 paper (Hattori et al., 2000). However, those that were mentioned in Hattori et al. (2000) are confirmed by our recursive segmentation. For example, the long (G 1 C)-poor domain (7.1 Mb at s0 ¼ 20 or 5.4 Mb at s0 ¼ 10) which lacks both genes and Alu elements (Hattori et al., 2000) can be seen clearly in Fig. 5. The (G 1 C)-rich 1 Mb region which is rich in Alu elements and pseudogenes (Hattori et al., 2000) can also be seen in Fig. 5 (lower plot). The number of segmented domains in the chromosome 22 contig is 322, 164, 80, 28, and ten, respectively, with the
stopping criterion s0 set at 5, 10, 20, 50, and 100. These numbers are larger than those obtained at the chromosome 21 contig at a corresponding strength level, indicating that chromosome 22 is more heterogeneous than the chromosome 21 sequence. The fact that (G 1 C)-rich chromosome 22 is more heterogeneous than (G 1 C)-poor chromosome 21 is consistent with the general trend connecting heterogeneity and G 1 C content. For random sequences, it was known that the closer (G 1 C)% is to 50%, the more heterogeneous the sequence becomes (Sueoka, 1962). This predicted effect was observed experimentally (Cuny et al., 1981; Bernardi, 1989), though the absolute value of heterogeneity is much stronger than for a random sequence. (Note that the same observation was rediscovered in Nekrutenko and Li (2000).) The domains-within-domain phenomenon is evident in Fig. 6. At a very crude level (high s0 values), the chromosome 22 contig consists of four to five large domains. The base composition, however, is clearly not homogeneous in each domain. At the s0 ¼ 50 level, there are no segmented domains larger than 2.5 Mb (ten segmented domains larger than 1 Mb, see upper plot of Fig. 6). At the s0 ¼ 20 level, only five segmented domains are larger than around 1 Mb. The sequence between positions 12,632,949 and 15,543,635 was mentioned in Dunham et al. (1999) as (G 1 C)-poor, Alu-poor, and gene-poor (only three genes). At s0 ¼ 20, there are seven sub-domains in the region of sizes 192, 706, 984, 356, 298, 93, and 281 kb, with (G 1 C)% at 40.8, 44.5, 41.1, 45, 51.3, 44, and 39.1%. In Ha¨ ring and Kypr (2001), whether isochores exist in
W. Li / Gene 276 (2001) 57–72
65
Fig. 5. Segmented domains of the longest contig in human chromosome 21 (NT002836). The sequence length is 28.515 Mb. The upper plot contains 18 domains segmented at s0 ¼ 20; the lower plot contains 62 domains segmented at s0 ¼ 10. The (G 1 C)% in the moving window is obtained by using a window length of 800 kb, and a moving distance of 80 kb. Smaller (G 1 C)-rich domains in the first 1 Mb not segmented at the s0 ¼ 20 level (upper plot) can be seen with a more relaxed stopping criterion s0 ¼ 10 (lower plot). The (G 1 C)% range of five types of isochores (L1, L2, H1, H2, and H3) is listed on the left side as a reference.
human chromosome 21 and 22 was debated. This question can be answered here definitely if one sets a fixed segmentation threshold s0 and an allowed minimal size for an isochore. If we set a minimum isochore size of 200 kb as suggested by compositional mapping data (Bernardi, 1995), then at s0 ¼ 10 (about the same level of heterogeneity as yeast chromosome III), there are 32 segmented domains in chromosome 21 (the maximum size is 5.44 Mb), and 30 such domains in chromosome 22 (the maximum size is 1.69 Mb). At s0 ¼ 5
(about the same level of heterogeneity as M. pneumoniae, if the short, extreme G 1 C content domains are removed), there are 39 200 kb-plus domains in human chromosome 21, and 31 such domains in chromosome 22 (the maximum sizes are 4.77 Mb and 825 kb, respectively). Clearly, some large domains obtained at s0 ¼ 10 are split into smaller domains at s0 ¼ 5, but there are still several 200 kb-plus domains even for chromosome 22. The difference between chromosome 21 and 22 is also very clear at
66
W. Li / Gene 276 (2001) 57–72
s0 ¼ 10: the maximum segmented domain size in chromosome 22 is 825 kb, which is much smaller than that for chromosome 21. In this sense, chromosome 22 is more heterogeneous than chromosome 21. It is difficult in Figs. 5 and 6 to examine domains with a size around 200 kb because they are at a smaller scale as compared to the chromosome length. Of course, if one is interested in a comparison to the yeast chromosome IV (homogeneous at s0 < 2–4) or a random sequence (homogeneous at s0 ¼ 0), the conclusion can be different.
4. Discussion 4.1. The advantage of window-less domain analysis over window-based analysis A moving-window approach is used in almost all analyses of compositional properties of genome sequences. One chooses a window size and a window moving distance and plots the (G 1 C)% in the window moving along the sequence. The choice of the window size reflects the interest
Fig. 6. Segmented domains of the longest contig in human chromosome 22 (NT001454). The sequence length is 22.964 Mb. The upper plot contains 28 domains segmented at s0 ¼ 50; the lower plot contains 80 domains segmented at s0 ¼ 20. The (G 1 C)% in the moving window is obtained by using a window length of 800 kb, and a moving distance of 80 kb. The (G 1 C)% range of five types of isochores (L1, L2, H1, H2, and H3) is listed on the left side as a reference.
W. Li / Gene 276 (2001) 57–72
of the researcher: if he/she is interested in long(short)-range compositional fluctuation, a large(small) window size is selected. Features presented at one choice of window size may disappear in another choice of size (Li et al., 1994). The choice of moving distance may be less of a problem: moving distances of about 1/100 of the window size are often used, or simply of one base. Recursive segmentation is a window-less approach: no choice of window size needs to be made. It finds significant segmentation regardless of the length of the left and right subsequences. We can see from Figs. 1–6 that domains of all sizes can be determined at the same time, as long as all segmentations achieve at least a certain strength level. The lower plot of Fig. 5 provides a striking example: domains as short as 1 kb co-exist with a 5.4 Mb-long domain. A window-based approach will be very unlikely to reveal short and long domains at the same time. Recursive segmentation can be potentially better than a window-based approach in determining the domain borders accurately. Unless the moving distance in a moving-window approach is one base, there is always a gray area determined by that distance. Even if the moving distance is a single base, the choice of the window size determines how far away from the border the two flanking sequences to be included in an averaging are. Recursive segmentation essentially uses flanking sequences around a border as far away as possible. By doing so, it may reinforce the signal at the border if two flanking sequences are homogeneous themselves (although it may affect the accuracy if the flanking sequences are not homogeneous).
Fig. 7. A comparison of the MHC, human chromosome 21, and human chromosome 22 sequences by plotting the average domain size (sequence length divided by the number of segmented domains) vs. the segmentation stopping criterion s0.
67
Fig. 8. Another comparison of the MHC, human chromosome 21, and human chromosome 22 sequences by plotting the ranked segmentation strength s (threshold s0 is set at 5) vs. the rank divided by the sequence length.
4.2. The relative nature of homogeneity and the domainswithin-domain phenomenon Figs. 1–6 and Tables 1–6 show that the number of homogeneous G 1 C domains in a sequence is a function of the chosen criterion for continuing/stopping a recursive segmentation. Although we use a specific stopping criterion s0 based on the BIC, the conclusion should be true for other measurements or criteria. For the six sequences analyzed here, yeast chromosome IV can be considered to be homogeneous at s0 around 1.4 (if the right telomere is removed from the sequence), yeast chromosome III sequence is homogeneous at s0 ¼ 6–9, a 7 Mb stretch of the (G 1 C)poor region of human chromosome 21 can be considered homogeneous at s0 ¼ 20, the class III domain of MHC sequence is homogeneous at s0 ¼ 25, and the class I domain in the same sequence is considered to be homogeneous at s0 ¼ 55. The whole MHC sequence and the two longest contigs in human chromosomes 21 and 22, on the other hand, are not usually considered to be homogeneous, because the criteria for them to be homogeneous are too low (or equivalently, stopping criteria for segmentation are too stringent). Can we determine which sequence is more heterogeneous than other sequences? This question can be partially answered by an examination of the number of segmentations (and domains) as a function of the segmentation strength s0, or a spectrum of the segmentation strength at a fixed strength value. Fig. 7 shows the average size of the segmented domains as a function of the segmentation criterion s0: the smaller the domain size, the more heterogeneous the sequence is. By this stan-
68
W. Li / Gene 276 (2001) 57–72
Table 1 Seven domains segmented at s0 ¼ 3 for yeast chromosome III sequence (yeast ch3, N ¼ 0:3 Mb, s0 ¼ 3, seven domains) a Start (bp)
1 620 45,863 108,359 206,888 214,874 270,931
End (bp)
619 45,862 108,358 206,887 214,873 270,930 315,341
Length (bp)
619 45,243 62,496 98,529 7986 56,057 44,411
Segmentation at end 2log(L2/L1)
Strength
99.62 119.25 164.91 248.73 137.16 108.13 –
3.64 4.14 5.74 8.94 5.20 3.27 –
(G 1 C)%
Comments
57.2 37.2 40.8 36.7 47.7 40.7 36.3
Left telomere Left (G 1 C)-rich peak
Right (G 1 C)-rich peak
a
The following information is included: starting and ending position of the domain; domain size; 2log of the likelihood ratio at the ending position (L2 is the likelihood for the two-subsequence model, and L1 is the likelihood for the one-sequence model); BIC-based segmentation strength at the ending position; G 1 C content in the segmented domain; and comments.
dard, the MHC sequence is similar to the chromosome 22 sequence, whereas both are more heterogeneous than the chromosome 21 sequence. (Note that when s0 approaches 5, most domains become small, and the differences between chromosomes 21 and 22 tend to disappear.) Fig. 8 shows the strength s of all segmentation points, ranked from high to low, when the stopping criterion is s0 ¼ 5. The ‘spectrum’ plot (also called rank-strength plot) in Fig. 8 is slightly different from the standard plot in that there is an attempt to remove the sequence length effect. Other things being equal, longer sequences will experience a greater number of segmentations than shorter sequences, even when the stopping criterion is set at the same value. We divide the rank by the sequence length, assuming that the number of segmentations is proportional to the sequence length. It can be seen that besides one strong segmentation, other segmentations in chromosome 21 are much weaker than those in chromosome 22 and the MHC sequence. Both Figs. 7 and 8 indicate that the human chromosome 21 sequence is less heterogeneous than the human chromosome 22 sequence and the MHC sequence. Figs. 7 and 8 are not the only plots that can be used to compare the sequence heterogeneity of different sequences. The only information used in Fig. 7 is the number of segmented domains, and no information is given on the base composition in each domain. In Roma´ n-Rolda´ n et al. (1998) a quantity called compositional complexity was used which utilizes not only the number of segmented domains, but also differences between base compositions in these
domains. When this compositional complexity is calculated at different significance levels of segmentation, an overall picture of the sequence complexity at different length scales can be viewed and compared. It should be interesting to compare Figs. 7 and 8 with such a sequence complexity profile plot. 4.3. On the existence of isochores Recently, several publications questioned the existence of isochores. One could find sentences such as: “genomes of multicellular organisms are much more heterogeneous in nucleotide composition than depicted by the isochore model” (Nekrutenko and Li, 2000), “no isochores in the human chromosomes 21 and 22?” (Ha¨ ring and Kypr, 2001), and “isochores do not appear to merit the prefix ‘iso’” (Lander et al., 2001). Since all these publications used a moving-window approach in their analysis, they may not be able to delineate the borders of relative homogeneous domains accurately before carrying out a homogeneity test. In our recursive segmentation, an answer to the existence of isochores can be tested easily, once a minimum domain size and a segmentation stopping criterion s0 are specified. For example, one may specify s0 ¼ 0 and a minimum size of 300 kb. This is somehow equivalent to comparing the level of homogeneity of any 300 kb sequence with a random sequence. Another possibility is to specify s0 ¼ 1:5 and a minimum size of 300 kb, which is a comparison of the level
Table 2 Three domains segmented at s0 ¼ 1 for yeast chromosome IV sequence (yeast ch4, N ¼ 1:5 Mb, s0 ¼ 1, three domains) a Start (bp)
1 242,132 1,524,875 a
End (bp)
242,131 1,524,874 1,531,974
Length (bp)
242,131 1,282,743 7100
The columns are the same as in Table 1.
Segmentation at end 2log(L2/L1)
Strength
65.94 125.73 –
1.32 3.41 –
(G 1 C)%
Comments
38.6 37.7 0.44
Right telomere
W. Li / Gene 276 (2001) 57–72
69
Table 3 Nine domains segmented at s0 ¼ 5 for bacterium M. pneumoniae sequence (M. pneumoniae, N ¼ 816 kb, s0 ¼ 5, nine domains) a Start (bp)
1 14,605 26,587 241,106 256,467 410,538 413,699 472,184 742,080 a
End (bp)
14,604 26,586 241,105 256,466 410,537 413,698 472,183 742,079 816,394
Length (bp)
14,604 11,982 214,519 15,361 154,071 3161 58,485 269,896 74,315
(G 1 C)%}
Segmentation at end 2log(L2/L1)
Strength
453.24 202.06 217.19 251.81 195.36 133.19 228.65 158.15 –
21.24 6.42 7.45 9.46 6.19 5.04 7.85 5.20 –
38.4 51.4 40.1 31.9 38.4 53.3 42.8 39.6 42.1
Comments
Putative Ori at 205 kb
The columns are as in Table 1. A proposed replication origin is located at the right end of the first large domain.
of homogeneity of a 300 kb sequence with the yeast chromosome IV sequence. A relaxed condition for s0 ¼ 5 is roughly equivalent to testing the sequence by using the yeast chromosome III sequence as the standard. As can be seen in Section 3.5, many such relative homogeneous domains exist in the human chromosome 21 and 22 sequences, so without question, isochores do exist in the human genome. 4.4. Some caveats of the segmentation procedure Some technical details of recursive segmentation have not been investigated thoroughly here, and they could be future research topics. First of all, our recursion is based on a repeated application of 1-to-2 segmentation, whereas for some sequence structures recursion of 1-to-3 or 1-to-many segmentations may be more appropriate. Even if we continue to use 1-to-2 segmentations, we have to be aware of some consequences of using the simpler approach. For example, if the sequence consists of three domains (G 1 C high, low, and high), it is typical that the first round of 1-to-2 segmentation will detect one domain border, and
the second round detects the second domain border. These detected borders, however, may not be accurate. The reason is that during the first round of recursion, two out of the three domains are combined to form one flanking sequence on one side of the segmentation point, and the third domain is the flanking sequence on the other side. The base composition contrast between the two flanking sequences is not as strong as that between two domains, and this will make border detection less accurate. This was observed in Li et al. (2001) in detection of the replication origins and replication terminus of bacterial genomes, when one replicon is split into two pieces by an arbitrarily chosen sequence starting point, and another replicon stays intact. The border determined by segmentation can be shifted by 1% of the total sequence length (Li et al., 2001). Secondly, a 1-to-2 segmentation is sometimes significant only within the subsequence in which it is carried out. It is not always clear whether a segmented domain from this subsequence is significantly different from its neighboring domain segmented from another subsequence. Or similarly, consider the branching binary tree example. Two leaves from the same branch are indeed significantly separated,
Table 4 Eleven domains segmented at s0 ¼ 20 for human MHC sequence (MHC, N ¼ 3:6 Mb, s0 ¼ 20, 11 domains) a Start (bp)
1 299,271 833,240 1,490,847 1,739,421 1,742,438 1,841,872 2,483,967 3,384,908 3,444,781 3,638,111 a
End (bp)
299,270 833,239 1,490,846 1,739,420 1,742,437 1,841,871 2,483,966 3,384,907 3,444,780 3,638,110 3,673,778
Length (bp)
299,270 533,969 657,607 248,574 3017 99,434 642,095 900,941 59,873 193,330 35,668
(G 1 C)%
Segmentation at end 2log(L2/L1)
Strength
632.35 1478.17 1547.96 640.76 634.85 7003.27 5195.53 8099.54 826.19 535.61 –
22.19 50.23 54.99 24.09 26.51 236.80 170.85 288.49 31.85 20.70 –
46.1 43.2 48.5 43.1 70.1 47.1 51.9 41.1 55.8 48.2 54.9
The columns are the same as in Table 1. The domains in different isochores (I, III, II, and extended II) are separated.
Comments
Class III Class II
70
W. Li / Gene 276 (2001) 57–72
Table 5 Nine domains in human chromosome 21 contig (NT002836) segmented at s0 ¼ 50 (human ch21 contig, N ¼ 28 Mb, s0 ¼ 50, nine domains) a Start (bp)
1 676,169 1,161,545 4,308,021 5,205,139 12,309,164 17,883,015 27,159,470 28,052,198 a
End (bp)
676,168 1,161,544 4,308,020 5,205,138 12,309,163 17,883,014 27,159,469 28,052,197 28,515,322
Length (bp)
676,168 485,376 3,146,476 897,118 7,104,025 5,573,851 9,276,455 892,728 463,125
Segmentation at end 2log(L2/L1)
Strength
3896.97 4859.79 2743.39 4867.75 6287.83 116,122.46 2189.21 2269.64 –
138.52 144.51 89.17 148.99 188.03 3381.35 66.84 69.14 –
(G 1 C)%
Comments
37.3 43.1 36.5 39.6 35.1 37.8 43.2 40.7 46.6
Gene-poor, Alu-poor Gene-rich, Alu-rich
See also Fig. 5 for the segmentation results at s0 ¼ 20 and s0 ¼ 10.
but it is not clear whether two leaves from two neighboring branches are significantly different. In the field of recursive partitioning and tree-based methods (e.g. Breiman et al., 1984), a branch-merging process could also be carried out. In our segmentation example, such a process corresponds to a re-visit of all segmentation points to see whether their removal will increase BIC. This merging technique has not been applied in this paper. Thirdly, in using the BIC defined in Eq. (6), the number of parameters for the two-subsequence model (K2) is assumed to be seven (or three for binary sequences), with the partition point included as a parameter. In our discussion on the null distribution of LLR, it was mentioned that the partition point is not a typical parameter: not only is it discrete, but also its range increases with the sample size. This fact causes problems in obtaining an analytic expression for the null distribution. It is not known whether we should treat the segmentation point in the same way as base compositions when counting the number of parameters for the application of BIC (there are arguments suggesting that the partition point represents more than one parameter (D. Siegmund, pers. commun.)). Fourthly, besides the possible problem with the number
of parameters in the model, another problem may exist for counting the number of samples. Currently, it is assumed that each nucleotide is an independent sample of a multinomial distribution, and the likelihood is a product of the probabilities of the individual observations. If the samples are not independent, the joint probability is not equal to the product of probabilities of individual observations. Nucleotides in a DNA sequence are typically correlated, but to complicate the matter, the degree of correlation may change from region to region. To model correlations in the likelihood function requires more parameters. It is not clear whether the increase of the maximum likelihood can be compensated by the larger number of parameters.
5. Conclusion Although there have been attempts to modify the movingwindow approach to capture features in DNA sequences with a relatively complicated base composition structure (for example, the hidden Markov model (Churchill, 1989) and the moving Markov model (Fickett et al., 1992)), top-tobottom, hierarchical modeling is always more powerful than
Table 6 Ten domains in human chromosome 22 contig (NT001454) segmented at s0 ¼ 100 (human ch22 contig, N ¼ 23 Mb, s0 ¼ 100, ten domains) a Start (bp)
1 5,062,627 7,044,536 7,458,958 8,285,257 8,742,075 12,632,949 15,771,263 19,223,290 20,686,171 a
End (bp)
5,062,626 7,044,535 7,458,957 8,285,256 8,742,074 12,632,948 15,771,262 19,223,289 20,686,170 22,963,602
Length (bp)
5,062,626 1,981,909 414,422 826,299 456,818 3,890,874 3,138,314 3,452,027 1,462,881 2,277,432
(G 1 C)%
Segmentation at end 2log(L2/L1)
Strength
18,070.82 3222.96 5980.66 3499.64 5060.51 4247.90 25,316.78 8897.47 18,883.74 –
544.17 108.70 196.80 123.41 158.74 130.22 745.83 280.77 622.86 –
See also Fig. 6 for the segmentation results at s0 ¼ 50 and s0 ¼ 20.
48.7 45.0 49.9 39.8 45.1 47.0 43.6 51.7 43.7 51.0
Comments
Gene-poor, Alu-poor
W. Li / Gene 276 (2001) 57–72
left-to-right, sequential modeling (Li, 1992), as proven in formal language theory (context-free and context-sensitive language are examples of top-to-bottom grammars, whereas regular language is an example of left-to-right grammar (Hopcroft and Ullman, 1979)). We believe that the recursive segmentation discussed here is more suitable for the study of complex compositional heterogeneity and the isochore/ domain structure in DNA sequences. Acknowledgements I would like to thank Oliver Clay, Jose´ Oliver, Pedro Bernaola-Galva´ n, Ivo Grosse, Yaning Yang, and David Siegmund for discussions, Fatemeh Haghighi for implementing the segmentation program, and Katherine Montague for proofreading the paper. This work is supported by the NIH grant K01HG00024. References Beck, S., Geraghty, D., Inoko, H., Rowen, L., et al., 1999. The MHC Sequencing Consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature 401, 921–923. Berger, J.O., Berry, D.A., 1988. Analyzing data: is objectivity possible? Am. Sci. 76, 159–165. Bernaola-Galva´ n, P., Roma´ n-Rolda´ n, R., Oliver, J.L., 1996. Compositional segmentation and long-range fractal correlations in DNA sequences. Phys. Rev. E 53, 5181–5189. Bernaola-Galva´ n, P., Grosse, I., Carpena, P., Oliver, J.L., Roma´ n-Rolda´ n, R., Stanley, H.E., 2000. Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys. Rev. Lett. 85, 1342–1345. Bernardi, G., 1989. The isochore organization of the human genome. Annu. Rev. Genet. 23, 637–661. Bernardi, G., 1995. The human genome: organization and evolutionary history. Annu. Rev. Genet. 23, 637–661. Bradnam, K.R., Seoighe, C., Sharp, P.M., Wolfe, K.H., 1999. G 1 C content variation along and among Saccharomyces cerevisiae chromosomes. Mol. Biol. Evol. 16, 666–675. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees, Wadsworth, Pacific Grove, CA. Burnham, K.P., Anderson, D.R., 1998. Model Selection and Inference: A Practical Information-Theoretic Approach. Springer, New York. Cormen, T.H., Leiserson, C.E., Rivest, R.L., 1990. Introduction to Algorithms, MIT Press, Cambridge, MA. Churchill, G.A., 1989. Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79–94. Cox, D.R., Hinkley, D.V., 1974. Theoretical Statistics. Chapman and Hall, London. Cuny, G., Soriano, P., Macaya, G., Bernardi, G., 1981. The major components of the mouse and human genomes. I. Preparation, basic properties and compositional heterogeneity. Eur. J. Biochem. 115, 227–233. Dujon, B., 1996. The yeast genome project: what did we learn? Trends Genet. 12, 263–270. Dunham, I., Shimizu, N., Chissoe, S., et al., 1999. The DNA sequence of human chromosome 22. Nature 402, 489–495. Fickett, J.W., Torney, D.C., Wolf, D.R., 1992. Base compositional structure of genomes. Genomics 13, 1056–1064. Fukagawa, T., Sugaya, K., Matsumoto, K., Okumura, K., Ando, A., Inoko, H., Ikemura, T., 1995. A boundary of long-range G 1 C% mosaic domains in the human MHC locus: pseudoautosomal boundary-like sequence exists near the boundary. Genomics 25, 184–191.
71
Fukagawa, T., Nakamura, Y., Okumura, K., Nogami, M., Ando, A., Inoko, H., Saito, N., Ikemura, T., 1996. Human pseudoautosomal boundarylike sequences: expression and involvement in evolutionary formation of the present-day pseudoautosomal boundary of human sex chromosomes. Hum. Mol. Genet. 5, 23–32. Grosse, I., Bernaola-Galva´ n, P., Carpena, P., Roma´ n-Rolda´ n, R., Oliver, J., Stanley, H.E., 2001. Analysis of symbolic sequences using the Jensen– Shannon divergence measure. Phys. Rev. E, to appear. Ha¨ ring, D., Kypr, J., 2001. No isochores in the human chromosomes 21 and 22? Biochem. Biophys. Res. Commun. 280 (2), 567–573. Hattori, M., Taudien, S., Kudoh, J., Nordsiek, G., Ramser, J., et al., 2000. The Chromosome 21 Mapping and Sequencing Consortium. The DNA sequence of human chromosome 21. Nature 405, 311–319. Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E., Li, B.C., Herrmann, R., 1996. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24, 4420–4449. Hopcroft, J.E., Ullman, J.D., 1979. Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA. Jacq, C., et al., 1997. The nucleotide sequence of Saccharomyces cerevisiae chromosome IV. Nature 387 (Suppl.), 75–78. James, B., James, K.L., Siegmund, D., 1987. Tests for a change-point. Biometrika 74, 71–83. Lander, E.S., Waterston, R.H., Sulston, J., Collins, F.S., et al., 2001. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lehmann, E.H., 1991. Testing Statistical Hypotheses. Wiley, New York. Li, W., 1992. Generating nontrivial long-range correlations and 1/f spectra by replication and mutation. Int. J. Bifurcation Chaos 2, 137–154. Li, W., 1997a. The complexity of DNA: the measure of compositional heterogeneity in DNA sequences and measures of complexity. Complexity 3, 33–37. Li, W., 1997b. The study of correlation structures of DNA sequences – a critical review. Comput. Chem. 21, 257–271 special issue on open problems of computational molecular biology. Li, W., 2001a. DNA segmentation as a model selection process. RECOMB 2001: Proceedings of the Fifth Annual International Conference of Computational Molecular Biology. ACM Press, New York, pp. 210– 216. Li, W., 2001b. New stopping criteria for segmenting DNA sequences. Phys. Rev. Lett. 86, 5815–5818. Li, W., Kaneko, K., 1992. Long-range correlation and partial 1/f spectrum in a noncoding DNA sequence. Europhys. Lett. 17, 655–660. Li, W., Marr, T., Kaneko, K., 1994. Understanding long-range correlations in DNA sequences. Phys. D 75, 392–416. Li, W., Stolovitzky, G., Bernaola-Galva´ n, P., Oliver, J.L., 1998. Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Res. 8, 916–928. Li, W., Bernaola-Galva´ n, P., Haghighi, F., Grosse, I., 2001. Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem., to appear. Lin, J., 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Info. Theory 37, 145–151. Nekrutenko, A., Li, W.H., 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10, 1986– 1995. Oliver, J.L., Li, W., 1998. Quantitative analysis of compositional heterogeneity in long DNA sequences: the two-level segmentation test (meeting abstract). Genome Mapping, Sequencing and Biology. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, p. 163. Oliver, J.L., Roma´ n-Rolda´ n, R., Perez, J., Bernaola-Galva´ n, P., 1999. SEGMENT: identifying compositional domains in DNA sequences. Bioinformatics 15, 974–979. Oliver, S.G., et al., 1992. The complete DNA sequence of yeast chromosome III. Nature 357, 38–46. Peng, C.K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simon, M., Stanley, H.E., 1992. Long-range correlations in nucleotide sequences. Nature 356, 168–170.
72
W. Li / Gene 276 (2001) 57–72
Pettitt, A.N., 1980. A simple cumulative sum type statistic for the changepoint problem with zero-one observations. Biometrika 67, 79–84. Roma´ n-Rolda´ n, R., Bernaola-Galva´ n, P., Oliver, J.L., 1998. Sequence compositional complexity of DNA through an entropic segmentation method. Phys. Rev. Lett. 80, 1344–1347. Royall, R.M., 1997. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall, London. Sharp, P.M., Lloyd, A.T., 1993. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Nucleic Acids Res. 21, 179–183.
Stephens, R., Horton, R., Humphray, S., Rowen, L., Trwosdale, J., Beck, S., 1999. Gene organisation, sequence variation and isochore structure at the centrometric boundary of the human MHC. J. Mol. Biol. 291, 789– 799. Sueoka, N., 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48, 582–592. Venter, J.C., et al., 2001. The sequence of the human genome. Science 291, 1304–1351. Voss, R., 1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805–3808.