deletions in the human genome

deletions in the human genome

Gene 376 (2006) 268 – 280 www.elsevier.com/locate/gene Characterization of frequencies and distribution of single nucleotide insertions/deletions in ...

785KB Sizes 0 Downloads 17 Views

Gene 376 (2006) 268 – 280 www.elsevier.com/locate/gene

Characterization of frequencies and distribution of single nucleotide insertions/deletions in the human genome Ene-Choo Tan a,c,⁎, Haixia Li b a

c

KK Research Centre, KK Women's and Children's Hospital, 100 Bukit Timah Road, Singapore 229899, Singapore b Bioinformatics Institute, National University of Singapore, Singapore Department of Psychological Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Received 9 May 2005; received in revised form 7 April 2006; accepted 12 April 2006 Available online 3 May 2006 Received by A. Bernardi

Abstract Most of the studies on single nucleotide variations are on substitutions rather than insertions/deletions. In this study, we examined the distribution and characteristics of single nucleotide insertions/deletions (SNindels), using data available from dbSNP for all the human chromosomes. There are almost 300,000 SNindels in the database, of which only 0.8% are validated. They occur at the frequency of 0.887 per 10 kb on average for the whole genome, or approximately 1 for every 11,274 bp. More than half occur in regions with mononucleotide repeats the longest of which is 47 bases. Overall the mononucleotide repeats involving C and G are much shorter than those for A and T. About 12% are surrounded by palindromes. There is general correlation between chromosome size and total number for each chromosome. Inter-chromosomal variation in density ranges from 0.6 to 21.7 per kilobase. The overall spectrum shows very high proportion of SNindel of types –/A and –/T at over 81%. The proportion of –/A and –/T SNindels for each chromosome is correlated to its AT content. Less than half of the SNindels are within or near known genes and even fewer (b 0.183%) in coding regions, and more than 1.4% of –/C and –/G are in coding compared to 0.2% for –/A and –/T types. SNindels of –/A and –/T types make up 80% of those found within untranslated regions but less than 40% of those within coding regions. A separate analysis using the subset of 2324 validated SNindels showed slightly less AT bias of 74%, SNindels not within mononucleotide repeats showed even less AT bias at 58%. Density of validated SNindels is 0.007/10 kb overall and 90% are found within or near genes. Among all chromosomes, Y has the lowest numbers and densities for all SNindels, validated SNindels, and SNindels not within repeats. © 2006 Elsevier B.V. All rights reserved. Keywords: dbSNP; Insertions/deletions; Density; Mononucleotide repeats; Palindromes

1. Introduction Since the completion of the first draft of the human genome, the study of human genome sequence has shifted its focus to the identification and characterization of variations. However, most of such studies are on substitutions rather than insertions/deletions, particularly single nucleotide polymorphisms (SNPs) involving only single base substitutions. As the most abundant Abbreviations: dbSNP, Single Nucleotide Polymorphism database; LINE, Long interspersed element; NCBI, National Center for Biotechnology Information; SNindel, Single nucleotide insertion/deletion; SINE, Short interspersed element. ⁎ Corresponding author. KK Research Centre, KK Women's and Children's Hospital, 100 Bukit Timah Road, Singapore 229899, Singapore. Tel.: +65 6394 3792; fax: +65 6394 1618. E-mail address: [email protected] (E.-C. Tan). 0378-1119/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2006.04.009

class of polymorphisms, SNPs have the advantage of being available for any region of the DNA, including exons, introns, promoters or regulatory regions. Their ubiquitous presence throughout all chromosomes means they can be used as very high density markers for linkage studies. They are also easy and inexpensive to genotype. However, SNPs are usually biallelic, which means the polymorphism information content is limited as the maximum heterozygosity is only 0.5. Insertion/deletions polymorphisms (indels) involve a difference in length between alleles. The difference can be anything from 1 base pair (bp) to hundreds or even more than a thousand. Indels can arise from random events during replication when there is slippage during base-pairing in regions with repeating units or the very organized actions of transposable elements. Insertions and deletions are common variations in the genomes of many

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

organisms. Based on size, they can be grouped into different classes: single nucleotide indels (SNindels), microsatellites, short interspersed repeat elements (SINEs), and long interspersed repeat elements (LINEs). For studies on insertions/deletions, much of the earlier work has been on transposable elements like SINES and LINES. The largest group of SINES, Alu insertion elements, is approximately 300 bp long and could number as many as 1 million per haploid genome (Rubin et al., 1980; Gu et al., 2000; International Human Genome Sequencing Consortium, 2001). They may be inherited through the germ line or arise through de novo transposition. LINEs are longer repeats of around 6 kb/unit. Although they are less abundant, there are still more than 800,000 copies and they might make up as much as 15–17% of the genome (Smit, 1996; Gu et al., 2000; Lander et al., 2001; Lutz et al., 2003). For smaller indels, previous studies have concentrated on microsatellites which are repeats of very short simple sequence with multiple alleles. The last decade saw their emergence as tools for linkage mapping of disease genes. They occur as tandem repeats of typically hundreds of units at each locus. The variable lengths of microsatellites are in a way due to insertion or deletion of multiple repeat units. They have been used for linkage analysis in genome scan for disease gene mapping and individual identification due to their high frequency, heterozygosity and polymorphism information content. Microsatellites have also been used for assessment of genetic diversity and tracing of parental lineage due to their greater polymorphism information content and allelic heterogeneity compared to other types of polymorphisms. However, most of the studies were done with dior tetranucleotide repeats due to greater ease of scoring and better characterization of their size range and allele frequencies. Compared to single nucleotide substitutions and microsatellites, single nucleotide insertions/deletions (SNindels) is a neglected area in the study of genomic variation. To date, no study has systematically evaluated the distribution and density of SNindels in the human genome as well as among the different chromosomes. In this study, we determined the frequency of the 4 different SNindels and variation in density between different human chromosomes. We also looked into the composition of the sequences immediately before and after the SNindel to investigate the relationship between observed variation and known mutation mechanisms, and to identify patterns which might shed light on context-sensitive mechanisms which affect biological processes such as replication and homologous recombination. Finally, we compare the pattern of distribution of the 4 types in the intronic, exonic and non-gene regions. 2. Materials and methods

269

submitted sequences that have no variations. Variations which mapped to multiple chromosomes or did not map to any chromosome were excluded from subsequent analysis. Computer programs were written in C to obtain and analyze the SNP information from the downloaded data, which were further divided into the 4 types based according to nucleotide identity. Chromosome length was obtained from National Center for Biotechnology Information, Build 33 of the Human Genome released on April 14, 2003. For extraction of data on repeats and palindromes flanking each SNindel, sequences were extracted using C program. A perl script was used to extract the functional class of all SNindels from SNP flatfiles for each chromosome, which were obtained by batch query using SNP rs number from NCBI flatfile database. The neighboring sequences were extracted from the FASTA format data using C program. For identification of repeat sequence, the program was written to identify SNindels with nucleotides of the same base immediately adjacent to –/N. For direct palindromes, nucleotide sequence before and after the SNindel was compared. To avoid double counting with the repeat sequences, the condition was set that the nucleotide next to the SNindel could not be of the same base. For reverse palindrome, reverse complement of the sequence 10 nucleotides after the SNindel was compared with the sequence before the SNindel. The comparison method is the same as that used for direct palindromes. 2.2. Data analysis Occurrence of all SNindels as recorded in the database was collected, and the number of polymorphisms and the identity of the base inserted/deleted at each position captured. The frequencies for each of the 4 nucleotides in the whole genome and on each chromosome were calculated. Pearson's correlation matrix was used to investigate correlation between chromosome length and number of each of the 4 types of SNindels, and pairwise correlation between the numbers of the SNindels of different types. Two sample t-tests were used to test that the mean numbers of –/A and –/T SNindels are equal, as well as for the mean numbers of –/C and –/G SNindels. Chi-square test was used to evaluate uniform distribution of the number of SNindels according to chromosome size. Difference in frequencies among the 4 types was also assessed by chi-square. All statistical analysis was performed using Minitab. Additional analyses were also performed on subsets of SNindels which were “validated” (as defined by information for that SNindel in the SNP flatfile), SNindels which were not within mononucleotide repeats, and SNindels which were within palindromic sequences.

2.1. SNindel data and data mining 3. Results Data was based on NCBI dbSNP build 120 (updated on March 18, 2004) with a total of 293,601 entries recorded. SNindels for the human genome were downloaded from http://ftp.ncbi.nim. nih.gov/dbSNP in the zipped FASTA format data for rs record, subdivided into chromosomes. This format provides the flanking sequence for each report of variation in dbSNP, as well as all

3.1. Frequency and distribution of SNindels among chromosomes A total of 9,098,790 SNPs were recorded in the dbSNP database at that time, of which 3.2% or 293,601 were identified

270

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

Table 1A Distribution and frequencies of the 4 different types of SNindels for each of the chromosomes according to the data mined from dbSNP CHR

‘–/A’

%

‘–/T’

%

‘–/G’

%

‘–/C’

%

% ‘–/G + ‘/C’

Total

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

8646 9479 11,418 7544 7022 7741 6568 5605 4134 6051 5209 5431 4118 3702 3422 2882 3382 3165 2032 5598 1726 1654 2990 124 119,643

41.4 42.0 42.0 41.7 41.6 41.1 40.1 40.3 41.3 40.5 40.3 41.9 42.2 40.8 41.0 38.3 38.5 40.5 39.0 38.6 38.8 33.7 41.1 39.5 40.8

8447 9201 11,172 7570 7037 7694 6744 5714 4050 6063 5189 5300 4041 3642 3373 3014 3464 3264 1925 5695 1674 1765 2955 117 119,110

40.4 40.7 41.1 41.8 41.7 40.9 41.1 41.1 40.4 40.6 40.1 40.9 41.4 40.1 40.4 40.1 39.5 41.7 37.0 39.3 37.6 36.0 40.7 37.3 40.6

1929 1904 2284 1518 1437 1685 1521 1265 952 1394 1272 1146 826 872 823 822 981 690 629 1623 515 762 678 36 27,564

9.2 8.4 8.4 8.4 8.5 9.0 9.3 9.1 9.5 9.3 9.8 8.8 8.5 9.6 9.8 10.9 11.2 8.8 12.1 11.2 11.6 15.5 9.3 11.5 9.4

1870 2007 2340 1459 1394 1706 1565 1321 885 1424 1265 1074 776 861 738 797 947 705 623 1579 539 728 644 37 27,284

9.0 8.9 8.6 8.1 8.3 9.1 9.5 9.5 8.8 9.5 9.8 8.3 8.0 9.5 8.8 10.6 10.8 9.0 12.0 10.9 12.1 14.8 8.9 11.8 9.3

18.2 17.3 17.0 16.5 16.8 18.0 18.8 18.6 18.3 18.9 19.6 17.1 16.4 19.1 18.7 21.5 22.0 17.8 24.0 22.1 23.7 30.4 18.2 23.2 18.7

20,892 22,591 27,214 18,091 16,890 18,826 16,398 13,905 10,021 14,932 12,935 12,951 9761 9077 8356 7515 8774 7824 5209 14,495 4454 4909 7267 314 293,601

as SNindels. They come from all 22 autosomes and the 2 sex chromosomes. More than 20,000 are identified for each of the three largest chromosomes, and less than 5000 for the two smallest (Table 1A). Although the number of SNindels follows the normal distribution according to chromosome size (P N 0.15), correlation of chromosome size with number for each of the 4 types is not complete with Pearson's correlation matrix ranging from 0.879 for –/A to 0.796 for –/C. The total number of SNindels and the number for each of the 4 types according to base identity are presented in Fig. 1.

Distribution of SNindels is not entirely uniform among the chromosomes. The largest number of 27,214 is recorded for chromosome 3 while chromosome Y has the smallest number with only 314. The smallest autosome, chromosome 21 has the least at 4454. In terms of density, SNindels occur at the frequency of 0.887 per 10 kb on average for the whole genome, or approximately 1 for every 11,274 bp. This density varies for different chromosomes. The highest density is found on chromosome 20 at 2.174/ 10 kb, while chromosome Y has the lowest at 0.061/10 kb, a

Fig. 1. Summary of the total number of SNindels mapped to each chromosome, subdivided into the 4 different types.

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

271

Table 1B Frequency of all SNindels for each chromosome CHR

Size

–/A

Freq (× 10− 4)

–/T

Freq (×10− 4)

–/G

Freq (×10− 4)

–/C

Freq (×10− 4)

Total (×10− 4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

282,193,664 253,256,583 227,524,578 202,328,347 203,085,532 182,415,242 166,623,906 152,776,421 142,271,444 145,589,288 150,783,553 144,282,489 119,744,898 106,953,321 101,380,521 104,298,331 89,504,553 86,677,548 74,962,845 66,668,005 44,907,570 47,662,662 162,599,930 51,513,584 3,310,004,815

8646 9479 11,418 7544 7022 7741 6568 5605 4134 6051 5209 5431 4118 3702 3422 2882 3382 3165 2032 5598 1726 1654 2990 124 119,643

3.064E−01 3.743E−01 5.018E−01 3.729E−01 3.458E−01 4.244E−01 3.942E−01 3.669E−01 2.906E−01 4.156E−01 3.455E−01 3.764E−01 3.439E−01 3.461E−01 3.375E−01 2.763E−01 3.779E−01 3.651E−01 2.711E− 01 8.397E−01 3.843E−01 3.470E−01 1.839E−01 2.407E−02 3.615E−01

8447 9201 11,172 7570 7037 7694 6744 5714 4050 6063 5189 5300 4041 3642 3373 3014 3464 3264 1925 5695 1674 1765 2955 117 119,110

2.993E−01 3.633E−01 4.910E−01 3.741E−01 3.465E−01 4.218E−01 4.047E−01 3.740E−01 2.847E−01 4.164E−01 3.441E−01 3.673E−01 3.375E−01 3.405E−01 3.327E−01 2.890E−01 3.870E−01 3.766E−01 2.568E−01 8.542E−01 3.728E−01 3.703E−01 1.817E−01 2.271E−02 3.598E−01

1929 1904 2284 1518 1437 1685 1521 1265 952 1394 1272 1146 826 872 823 822 981 690 629 1623 515 762 678 36 27,564

6.836E− 02 7.518E− 02 1.004E− 01 7.503E− 02 7.076E− 02 9.237E− 02 9.128E− 02 8.280E− 02 6.691E− 02 9.575E− 02 8.436E− 02 7.943E− 02 6.898E− 02 8.153E− 02 8.118E− 02 7.881E− 02 1.096E− 01 7.961E− 02 8.391E− 02 2.434E− 01 1.147E− 01 1.599E− 01 4.170E− 02 6.988E− 03 8.327E− 02

1870 2007 2340 1459 1394 1706 1565 1321 885 1424 1265 1074 776 861 738 797 947 705 623 1579 539 728 644 37 27,284

6.627E− 02 7.925E− 02 1.028E− 01 7.211E− 02 6.864E− 02 9.352E− 02 9.392E− 02 8.647E− 02 6.221E− 02 9.781E− 02 8.390E− 02 7.444E− 02 6.480E− 02 8.050E− 02 7.280E− 02 7.642E− 02 1.058E− 01 8.134E− 02 8.311E− 02 2.368E− 01 1.200E− 01 1.527E− 01 3.961E− 02 7.183E− 03 8.243E− 02

7.403E− 01 8.920E− 01 1.196E+00 8.941E− 01 8.317E− 01 1.032E+00 9.841E− 01 9.102E− 01 7.044E− 01 1.026E+00 8.579E− 01 8.976E− 01 8.151E− 01 8.487E− 01 8.242E− 01 7.205E− 01 9.803E− 01 9.027E− 01 6.949E− 01 2.174E+00 9.918E− 01 1.030E+00 4.469E− 01 6.095E− 02 8.870E− 01

difference of 35 times. Chromosome 19 is the lowest for autosomes at 0.695/10 kb. Chromosome 3 which has the highest number of SNindels has a density of 1.196/10 kb and chromosome 21 which has the fewest among autosomes is at 0.992/10 kb (Table 1B and Fig. 2). Only 2324 (less than 1%) of the SNindels in the dbSNP database are validated. For validated SNindels, chromosome 6 has the largest number, followed by chromosomes 1, 2 and 5 (Table 2A). Density is 0.007/10 kb overall, with the highest at 0.014/10 kb for chromosome 22 and lowest at 0.001/10 kb for

chromosome Y. The autosome with the lowest density of validated SNindels is chromosome 13 at 0.004/10 kb (Table 2B). Less than 36% of the SNindels in the dbSNP database (or 105,160) do not involve mononucleotide repeats. Eleven percent of these are found on chromosome 3. Chromosome 19 and chromosome Y have the lowest numbers of such SNindels (Table 3A). In terms of frequency, chromosome 20 has the highest density at 0.83/10 kb. The frequency for chromosome Y is 32× lower while other autosomes have frequencies ranging between 0.23 and 0.53/10 kb (Table 3B).

Fig. 2. Comparison of densities of SNindels mapped to each chromosome, subdivided into the 4 different types.

272

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

Table 2A Distribution and frequencies of the 4 different types of validated SNindels for each of the chromosomes according to the data mined from dbSNP CHR

‘–/A’

%

‘–/T’

%

‘–/G’

%

‘–/C’

%

% ‘–/G + ‘/C’

Total

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

69 54 49 41 47 81 20 28 26 40 33 39 14 29 15 11 22 12 16 27 12 13 18 1 717

30.5 31.2 36.8 37.6 30.9 33.5 26.3 30.1 27.4 33.3 31.7 29.5 27.5 41.4 28.3 22.4 24.2 36.4 17.8 32.5 33.3 19.7 43.9 16.7 30.9

87 61 47 37 50 86 28 29 31 42 32 49 21 18 17 10 29 9 19 21 9 17 16 2 767

38.5 35.3 35.3 33.9 32.9 35.5 36.8 31.2 32.6 35.0 30.8 37.1 41.2 25.7 32.1 20.4 31.9 27.3 21.1 25.3 25.0 25.8 39.0 33.3 33.0

36 23 19 19 26 37 17 17 19 18 17 22 9 8 10 12 19 8 28 18 5 16 5 3 411

15.9 13.3 14.3 17.4 17.1 15.3 22.4 18.3 20.0 15.0 16.3 16.7 17.6 11.4 18.9 24.5 20.9 24.2 31.1 21.7 13.9 24.2 12.2 50.0 17.7

34 35 18 12 29 38 11 19 19 20 22 22 7 15 11 16 21 4 27 17 10 20 2 0 429

15.0 20.2 13.5 11.0 19.1 15.7 14.5 20.4 20.0 16.7 21.2 16.7 13.7 21.4 20.8 32.7 23.1 12.1 30.0 20.5 27.8 30.3 4.9 0.0 18.5

31.0 33.5 27.8 28.4 36.2 31.0 36.8 38.7 40.0 31.7 37.5 33.3 31.4 32.9 39.6 57.1 44.0 36.4 61.1 42.2 41.7 54.5 17.1 50.0 36.1

226 173 133 109 152 242 76 93 95 120 104 132 51 70 53 49 91 33 90 83 36 66 41 6 2324

3.2. Nucleotide bias and frequencies The SNindels are not evenly distributed into the 4 different bases. Total number for –/A (119,643) and –/T (119,110) are more than 4 times the number for –/C (27,284) and –/G (27,564). As the strand from which the SNindels were identified

was random and data was captured from both strands, there is likely to be double counting of the same site. Consequently the total number of –/A SNindel is expected to be close to that of –/ T and the number of –/C close to that of –/G. This is borne out by the two sample t-test that there is no significant difference between the mean number of –/A and –/T SNindels (two-tailed

Table 2B Frequency of validated SNindels for each chromosome CHR

Size

–/A

Freq (×10− 4)

–/T

Freq (×10− 4)

–/G

Freq (×10− 4)

–/C

Freq (×10− 4)

Total (×10− 4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

282,193,664 253,256,583 227,524,578 202,328,347 203,085,532 182,415,242 166,623,906 152,776,421 142,271,444 145,589,288 150,783,553 144,282,489 119,744,898 106,953,321 101,380,521 104,298,331 89,504,553 86,677,548 74,962,845 66,668,005 44,907,570 47,662,662 162,599,930 51,513,584 3,310,004,815

69 54 49 41 47 81 20 28 26 40 33 39 14 29 15 11 22 12 16 27 12 13 18 1 717

2.445E− 03 2.132E− 03 2.154E− 03 2.026E− 03 2.314E− 03 4.440E− 03 1.200E− 03 1.833E− 03 1.827E− 03 2.747E− 03 2.189E− 03 2.703E− 03 1.169E− 03 2.711E− 03 1.480E− 03 1.055E− 03 2.458E− 03 1.384E− 03 2.134E− 03 4.050E− 03 2.672E− 03 2.728E− 03 1.107E− 03 1.941E− 04 2.166E− 03

87 61 47 37 50 86 28 29 31 42 32 49 21 18 17 10 29 9 19 21 9 17 16 2 767

3.083E− 03 2.409E− 03 2.066E− 03 1.829E− 03 2.462E− 03 4.715E− 03 1.680E− 03 1.898E− 03 2.179E− 03 2.885E− 03 2.122E− 03 3.396E− 03 1.754E− 03 1.683E− 03 1.677E− 03 9.588E− 04 3.240E− 03 1.038E− 03 2.535E− 03 3.150E− 03 2.004E− 03 3.567E− 03 9.840E− 04 3.882E− 04 2.317E− 03

36 23 19 19 26 37 17 17 19 18 17 22 9 8 10 12 19 8 28 18 5 16 5 3 411

1.276E− 03 9.082E− 04 8.351E− 04 9.391E− 04 1.280E− 03 2.028E− 03 1.020E− 03 1.113E−03 1.335E− 03 1.236E− 03 1.127E− 03 1.525E− 03 7.516E− 04 7.480E− 04 9.864E− 04 1.151E− 03 2.123E− 03 9.230E− 04 3.735E− 03 2.700E− 03 1.113E−03 3.357E− 03 3.075E− 04 5.824E− 04 1.242E− 03

34 35 18 12 29 38 11 19 19 20 22 22 7 15 11 16 21 4 27 17 10 20 2 0 429

1.205E− 03 1.382E− 03 7.911E−04 5.931E− 04 1.428E− 03 2.083E− 03 6.602E− 04 1.244E− 03 1.335E− 03 1.374E− 03 1.459E− 03 1.525E− 03 5.846E− 04 1.402E− 03 1.085E− 03 1.534E− 03 2.346E− 03 4.615E− 04 3.602E− 03 2.550E− 03 2.227E− 03 4.196E− 03 1.230E− 04 0.000E+00 1.296E− 03

8.009E− 03 6.831E− 03 5.846E− 03 5.387E− 03 7.485E− 03 1.327E− 02 4.561E− 03 6.087E− 03 6.677E− 03 8.242E− 03 6.897E− 03 9.149E− 03 4.259E− 03 6.545E− 03 5.228E− 03 4.698E− 03 1.017E− 02 3.807E− 03 1.201E− 02 1.245E− 02 8.016E− 03 1.385E− 02 2.522E− 03 1.165E− 03 7.021E− 03

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

273

Table 3A Distribution and frequencies of the 4 different types of non-repeat SNindels for each of the chromosomes according to the data mined from dbSNP CHR

‘–/A’

%

‘–/T’

%

‘–/G’

%

‘–/C’

%

% ‘–/G + ‘/C’

Total

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

2057 2238 4091 1790 1720 1910 1679 1395 965 1477 1306 1282 1009 860 847 649 771 737 470 1490 582 386 735 38 30,484

29.1 29.9 34.0 28.9 29.5 28.9 28.3 28.3 28.7 28.4 28.1 29.6 30.5 27.9 29.6 25.3 25.6 27.3 26.9 26.9 28.3 20.0 28.2 28.1 29.0

1985 2083 4000 1864 1746 1920 1683 1413 962 1431 1309 1263 983 845 800 685 769 804 397 1447 570 395 743 42 30,139

28.0 27.9 33.2 30.1 30.0 29.0 28.4 28.7 28.6 27.6 28.2 29.2 29.7 27.4 27.9 26.7 25.6 29.8 22.7 26.1 27.7 20.4 28.5 31.1 28.7

1541 1541 1950 1302 1187 1397 1276 1046 744 1148 1013 910 679 703 644 631 756 582 434 1348 450 583 592 28 22,485

21.8 20.6 16.2 21.0 20.4 21.1 21.5 21.3 22.1 22.1 21.8 21.0 20.5 22.8 22.5 24.6 25.1 21.6 24.8 24.3 21.9 30.1 22.7 20.7 21.4

1495 1614 1991 1244 1171 1386 1291 1068 697 1138 1018 870 640 678 573 604 712 572 449 1255 456 570 533 27 22,052

21.1 21.6 16.5 20.1 20.1 21.0 21.8 21.7 20.7 21.9 21.9 20.1 19.3 22.0 20.0 23.5 23.7 21.2 25.7 22.7 22.2 29.5 20.5 20.0 21.0

42.9 42.2 32.8 41.1 40.5 42.1 43.3 43.0 42.8 44.0 43.7 41.2 39.8 44.8 42.5 48.1 48.8 42.8 50.5 47.0 44.0 59.6 43.2 40.7 42.4

7078 7476 12,032 6200 5824 6613 5929 4922 3368 5194 4646 4325 3311 3086 2864 2569 3008 2695 1750 5540 2058 1934 2603 135 105,160

P = 0.985) or between –/G and –/C SNindels (two-tailed P = 0.950). In general, the proportion of –/C and –/G SNindels is about 18.7% for all chromosomes (Table 1A). Chromosome 22 showed additional difference in that the proportion of –/G and –/C is much higher than for other chromosomes at 30.4%

(P = 0.0005). The chromosome with the lowest proportion of –/ C and –/G SNindels is chromosome 13 at 16.4% (Fig. 3). For validated SNindels, –/G and –/C make up 36%. There is no statistically significant difference between the total number of –/A SNindel and that of –/T, and the number of –/C SNindel is also close to that of –/G. –/G and –/C SNindels for individual

Table 3B Frequency of SNindels in non-repeat regions for each chromosome CHR

Size

–/A

Freq (×10− 4)

–/T

Freq (×10− 4)

–/G

Freq (×10− 4)

–/C

Freq (×10− 4)

Total (×10− 4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total

282,193,664 253,256,583 227,524,578 202,328,347 203,085,532 182,415,242 166,623,906 152,776,421 142,271,444 145,589,288 150,783,553 144,282,489 119,744,898 106,953,321 101,380,521 104,298,331 89,504,553 86,677,548 74,962,845 66,668,005 44,907,570 47,662,662 162,599,930 51,513,584 3,310,004,815

2057 2238 4091 1790 1720 1910 1679 1395 965 1477 1306 1282 1009 860 847 649 771 737 470 1490 582 386 735 38 30,484

7.289E− 02 8.837E− 02 1.798E− 01 8.847E− 02 8.469E− 02 1.047E− 01 1.008E− 01 9.131E− 02 6.783E− 02 1.014E− 01 8.661E− 02 8.885E− 02 8.426E− 02 8.041E− 02 8.355E− 02 6.223E− 02 8.614E− 02 8.503E− 02 6.270E− 02 2.235E− 01 1.296E− 01 8.099E− 02 4.520E− 02 7.377E− 03 9.210E− 02

1985 2083 4000 1864 1746 1920 1683 1413 962 1431 1309 1263 983 845 800 685 769 804 397 1447 570 395 743 42 30,139

7.034E− 02 8.225E− 02 1.758E− 01 9.213E− 02 8.597E− 02 1.053E− 01 1.010E− 01 9.249E− 02 6.762E− 02 9.829E− 02 8.681E− 02 8.754E− 02 8.209E− 02 7.901E− 02 7.891E− 02 6.568E− 02 8.592E− 02 9.276E− 02 5.296E− 02 2.170E− 01 1.269E− 01 8.287E− 02 4.569E− 02 8.153E− 03 9.105E− 02

1541 1541 1950 1302 1187 1397 1276 1046 744 1148 1013 910 679 703 644 631 756 582 434 1348 450 583 592 28 22,485

5.461E− 02 6.085E− 02 8.571E− 02 6.435E− 02 5.845E− 02 7.658E− 02 7.658E− 02 6.847E− 02 5.229E− 02 7.885E− 02 6.718E− 02 6.307E− 02 5.670E− 02 6.573E− 02 6.352E− 02 6.050E− 02 8.446E− 02 6.715E− 02 5.790E− 02 2.022E− 01 1.002E− 01 1.223E− 01 3.641E− 02 5.435E− 03 6.793E− 02

1495 1614 1991 1244 1171 1386 1291 1068 697 1138 1018 870 640 678 573 604 712 572 449 1255 456 570 533 27 22,052

5.298E− 02 6.373E− 02 8.751E− 02 6.148E− 02 5.766E− 02 7.598E− 02 7.748E− 02 6.991E− 02 4.899E− 02 7.817E− 02 6.751E− 02 6.030E− 02 5.345E− 02 6.339E− 02 5.652E− 02 5.791E− 02 7.955E− 02 6.599E− 02 5.990E− 02 1.882E− 01 1.015E− 01 1.196E− 01 3.278E− 02 5.241E− 03 6.662E− 02

2.508E− 01 2.952E− 01 5.288E− 01 3.064E− 01 2.868E− 01 3.625E− 01 3.558E− 01 3.222E− 01 2.367E− 01 3.568E− 01 3.081E− 01 2.998E− 01 2.765E− 01 2.885E− 01 2.825E− 01 2.463E− 01 3.361E− 01 3.109E− 01 2.334E− 01 8.310E− 01 4.583E− 01 4.058E− 01 1.601E− 01 2.621E− 02 3.177E− 01

274

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

Fig. 3. Proportion of the 4 different types of SNindels for each of the chromosomes.

chromosomes range from 17.1% for chromosome X and 31.0% for chromosome 1 to 61.1% for chromosome 19 (Table 2A). For SNindels not within repeats, there is more equitable distribution with 42% being –/G and –/C SNindels (Table 3A). At either end of the spectrum is chromosome 3 with the lowest at only 32.8% and chromosome 22 with 59.6%, other chromosomes have values between 39.8% and 50.5%. 3.3. Regional/Functional bias The information on the functional class of all SNindels obtained from the dbSNP flatfile is summarized in Table 4. More than half of the identified SNindels are in regions

classified as “Locus_unknown” which do not harbor any known genes. Among the 124,390 which are within known genes, only 537 (0.43%) are in translated regions with another 18,936 (15.2%) in untranslated regions of mRNAs while the remaining 104,917 (84.3%) are located in introns. This pattern holds true across all chromosomes. In terms of distribution into the different functional classes for those mapped to gene regions, more than 1.4% of –/C and –/G are in coding compared to 0.2% for –/A and –/T SNindels (χ2 = 655.21, P b 0.000). Comparing SNindels which map within or near known genes (functional classes “Gene region” and “Locus_region”) and those which map to regions with no known genes (functional class “Unknown”), there is statistically significant difference in

Table 4 Distribution of different functional classes for each type of SNindel Functional class base identity –/A –/T –/G –/C All

Intron

Exon

Splice site

Others

UTR c

Coding

13 0.01% 16 0.01% 6 0.02% 9 0.03% 44 0.02%

42,583 35.71% 42,587 35.87% 9906 36.13% 9797 37.47% 104,873 35.85%

7642 6.41% 7494 6.31% 1877 6.85% 1923 7.36% 18,936 6.47%

98 0.08% 103 0.09% 164 0.60% 172 0.66% 537 0.18%

Total in gene region

Locus_region a

Locus_unknown b

50,336 42.22% 50,200 42.28% 11,953 43.59% 11,901 45.52% 124,390 42.52%

4532 3.80% 4258 3.59% 1416 5.16% 1489 5.70% 11,695 4.00%

64,775 54.33% 64,652 54.46% 14,195 51.77% 13,894 53.15% 157,516 53.85%

Percentages are calculated within each SNindel type for division into the different functional classes. a Locus_region, variation is within 2 kb 5′ or 500 bp 3′ of a gene feature (on either strand), but the variation is not in the transcript for the gene. b Locus_unknown, the locus link information is not available in the flatfile format, thus not mapped to any gene. c MRNA-UTR, the variation is in the transcript of a gene but not in the coding region of the transcript.

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

275

Table 5 Types of SNindel for each of the functional classes as defined by dbSNP Functional class

Gene region

SNindel type

Intron

Splice site Others

Exon

UTR Coding

Total in gene region Locus_region (within 2 kb 5′ or 500 bp 3′ of a gene feature) Loc_unknown (not mapped to any gene) All

–/A

–/T

–/G

–/C

Total

13 29.55% 42,583 40.60% 7642 40.36% 98 18.25% 50,336 40.47% 4532 38.75% 64,775 41.12% 119,643 40.75%

16 36.36% 42,587 40.61% 7494 39.58% 103 19.18% 50,200 40.36% 4258 36.41% 64,652 41.04% 119,110 40.57%

6 13.64% 9906 9.45% 1877 9.91% 164 30.54% 11,953 9.61% 1416 12.11% 14,195 9.01% 27,564 9.39%

9 20.45% 9797 9.34% 1923 10.16% 172 32.03% 11,901 9.57% 1489 12.73% 13,894 8.82% 27,284 9.29%

44 100% 104,873 100% 18,936 100% 537 100% 124,390 100% 11,695 100% 157,516 100% 293,601 100%

Percentages are calculated within each functional group for division into the 4 SNindel types.

the distribution with the frequencies of –/A and –/T SNindels higher in gene and locus regions compared to regions with no known genes (χ2 = 315.42, P b 0.00). The relative frequencies of the 4 types of SNindels within each functional class are presented in Table 5. For introns, the proportion of –/C and –/G SNindels are higher for those within splice sites compared to those located within non-splice sites (χ 2 = 8.077, P b 0.044). For exons, there is also statistically significant difference in the distribution between translated and untranslated regions (χ2 = 389.15, P b 0.00). Almost 80% of SNindels within untranslated regions are types –/A and –/T whereas they account for only less than 40% of coding SNindels. For validated SNindels, about 80% are found within the gene region, another 10% near to genes (within 2 kb of known gene) and only 10% in regions not near any identified genes. For SNindels within coding regions, the proportion of –/G and –/C is 62% even though they only make up 19% of SNindels in gene regions. For SNindels not within repeats, only about 42% are found in gene regions. This proportion is similar for all 4 types of SNindels. For those within introns, –/A and –/T SNindels make up over 60% while distribution is more equitable for those within exons. The different functional regions had similar distribution into the 4 types of SNindels.

It would be expected that SNindels would occur at higher frequency within repeats, especially at positions when there is a repeat of the same base. Our results show that repeats of the same base around the SNindel site are indeed high. About 94.5% of those within repeats are of the –/A and –/T types (Table 7), of which about 58% are between 8 and 15 units of the same base. For –/G and –/C SNindels, shorter repeats of less than 8 are most frequent, and only one quarter of the repeats involving the same base as the SNindel are between 8 and 15 units (Fig. 5). Besides higher frequency of longer repeats for –/A and –/T, their longest repeats are also longer compared to –/G and –/C SNindels. The longest repeat found was 47 for –/A, 43 for –/T, 18 for –/G and 18 for –/C SNindels. For sex chromosomes, the longest repeats are all shorter at 40 for –/A, 35 for –/T, 13 for –/ G, and 12 for –/C, all are much shorter than those for autosomes, with the length difference of the longest repeat at 8 nucleotides for –/T and 5 nucleotides for –/G.

3.4. Neighboring sequence

3.6. Palindromes

For all 4 types, bases of the same identity as the SNindel are the most common within 10 bp before and after the SNindel position. For types –/A and –/T, they are as high as 60% for some chromosomes (Fig. 4). For types –/C and –/G, nucleotides of the same identity are lower at less than 30% but still higher than any of the other three non-identical bases. Between the 22 autosomes and the 2 sex chromosomes, there is no difference in the neighboring base composition (Table 6).

We define palindromes as a symmetrical sequence on both sides of the SNindel (eg ATGC N CGTA) or a sequence which is identical to the complementary strand on either side of the SNindel (eg ATGC N GCAT). For direct palindromes, about 2000 were found for each of the 4 SNindels (Table 7). Most of them are 4–7 nucleotides in length with –/G SNindels having a higher percentage of the longer direct palindromes. Overall there are 5 times more reverse palindromes. However, the

For flanking nucleotides which are different in identity from the SNindel, our analysis showed that A and T occur more frequently than G and C both before and after the SNindel position (Fig. 4). 3.5. Occurrence within repeats

276

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

Fig. 4. Base composition of nucleotides10 positions before and 10 positions after the SNindel for the autosomes (first 4 columns are the composition before –/N and the next 4 the breakdown after –/N) and sex chromosomes (first 4 columns are the composition before –/N and the next 4 the breakdown after –/N).

increase is all from those surrounding the –/A and –/T types as the numbers for the –/G and –/C SNindels are similar to the direct palindromes (Table 6). Across all 4 types, the average

length of reverse palindromes is much shorter than for direct palindromes (Fig. 6A and B). 4. Discussion

Table 6 Breakdown of sequence composition for 10 bases before and 10 bases after the SNindel position Mean composition Autosomes –/A –/T –/G –/C Sex chromosomes –/A –/T –/G –/C

Before

After

%A

%T

%G

%C

%A

%T

%G

%C

58.3 17.2 27.4 25.2

18.6 60.1 25.2 26.5

10.4 10.2 29.8 19.0

12.8 12.5 17.5 29.4

59.2 18.9 26.8 25.3

17.6 57.4 25.8 28.0

12.7 13.1 27.8 18.1

10.5 10.6 19.6 28.6

59.0 17.3 29.8 23.9

18.4 60.3 24.3 27.2

10.2 10.2 28.2 20.5

12.5 12.3 17.7 28.4

60.0 18.6 27.3 25.2

17.2 57.8 25.3 28.3

12.3 13.1 28.2 16.9

10.4 10.6 19.3 29.6

4.1. Data validation and bias In this study, our analysis used only data from the dbSNP database, which identifies variations through in silico comparison of DNA sequences submitted by different sources. As the existence of the identified SNindels was not validated experimentally, many of them may well turn out to be sequencing errors and not true SNindel polymorphisms. As an indication of the true existence of such variations identified through in silico alignment and catalogued in the database, only 80% of the single nucleotide polymorphisms found by the SNP Consortium were reported to be present (Marth et al., 2001). In an analysis of 2000 diallelic indels, only 14% of those with 1-bp difference were

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

277

Table 7 Summary of total number of each of the 4 different types of SNindels distributed into the different repeat and palindrome size classes Length of the sequence

Number

Repeats 4–7 8–15 15–20 20+ Total Direct palindromes 4.6 8+ Total Reverse palindromes 4.6 8+ Total

Percentage

–/A

–/T

–/G

–/C

–/A

–/T

–/G

–/C

13,604 52,184 16,108 7075 88,971

13,604 52,282 16,031 7242 89,159

3837 1234 8 – 5079

3888 1338 6 – 5232

15.29 58.65 18.10 7.95 100.00

15.26 58.64 17.98 8.12 100.00

75.55 24.30 0.16 – 100.00

74.31 25.57 0.11 – 100.00

1451 519 1970

1421 581 2002

1297 873 2170

1299 793 2092

73.65 26.35 100.00

70.98 29.02 100.00

59.77 40.23 100.00

62.09 37.91 100.00

10,219 1006 11,225

10,061 1103 11,164

1653 226 1879

1554 225 1779

91.04 8.96 100.00

90.12 9.88 100.00

87.97 12.03 100.00

87.35 12.65 100.00

confirmed to be present (Weber et al., 2002). Hence the total number of SNindels reported in this study is likely to be an overestimate. However, if the higher count is due to random sequencing errors, the comparison between different chromosomes, nucleotides or functional categories would not be changed as invalid SNindels will only change the total number and density estimates. Another caveat is that the dbSNP database might also be biased in favor of gene/unique regions and thus its content might not be representative of the whole genome. This is because gene regions are more likely to have more sequence submissions than non-gene regions, resulting in biased sequence availability. Hence it is likely that SNindel numbers for non-gene regions might actually be higher as these regions are likely to be undersequenced and thus have fewer overlapping contigs for comparison. If SNindels were equally distributed in gene and non-gene regions, then we expect the numbers in non-gene regions to be higher than the 53.6% found in this study, since they are thought to constitute more than 80% of the genome.

Until all the SNindels in the database have been experimentally validated we do not know whether there is indeed a bias in coding region, although it is reported that dinucleotide repeats are essentially constant in both coding and non-coding regions (Russell et al., 1976). It should also be noted that there is definitely some double counting as data from both strands are captured and it is not possible to select from only one strand. Nevertheless our analyses can serve as an indication of the variation for indels with length difference of one nucleotide. 4.2. SNindel content and distribution The GC content of the human genome is approximately 40.9% (Zhao et al., 2003), but the proportion of –/C and –/G SNindels is only 18.7%. There is overrepresentation of –/A and –/T SNindels in contrast to substitutions which has closer correlation to genome GC content (Marth et al., 2003). For validated SNindels it is less skewed but the difference still

Fig. 5. Distribution of different lengths of repeats of the same base around the 4 types of SNindel (length of repeats from 4 to 20+).

278

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

for the generation of SNindels. Thus the low proportion of –/C and –/G is another reflection of the low frequencies of polyC/ polyG in the genome. Short sequence repeats of other lengths also follow this pattern of AT bias, with the most common dinucleotide repeats of the AC and AT types, and the most common trinucleotide repeat is AAT (International Human Genome Sequencing Consortium, 2001; Astolfi et al., 2003). On the other hand, the higher proportion of –/G and –/C SNindels in coding region is probably a reflection of the higher GC content of coding compared to non-coding/non-gene regions. In terms of density, the total counts follow chromosome size suggesting that the process by which SNindels are generated is probably random. Consistent with that is the observation that chromosomes 19 and 22 which have the two highest proportion of –/C and –/G SNindels are also the two chromosomes with the highest GC content. At the other end of the spectrum, chromosome 4 which has the lowest GC content has the lowest proportion of –/C and –/G SNindels (Zhao et al., 2003). Another study comparing between chromosome 21 and 22 showed that chromosome 22 has significantly more mononucleotide repeats, probably accounting for the higher number and density for this chromosome (Katti et al., 2001). 4.3. Implication on genome stability and evolution

Fig. 6. A. Distribution of different lengths of direct palindromes around the 4 types of SNindel. B. Distribution of different lengths of reverse palindromes around the 4 types of SNindel.

exists. This might reflect the stricter process for checking and control of indels involving G or C, which is probably different from substitutions which arise mostly from damage and inaccurate repair after deamination/methylation rather than slippage during replication. A study on mononucleotide repeats on human chromosome 22 reported that polyA/T tracts are more abundant than polyC/G tracts by 100 times in terms of length (Toth et al., 2000). In this study we found that SNindels occurring in polyA/T regions are 20 times more common than those in polyC/G (Table 5). Another study investigating the distribution of simple sequence repeats in eukaryotic genomes reported that mononucleotide repeats are predominantly polyA/polyT, and polyC/polyG repeats are rare (Katti et al., 2001). Hence, in addition to the higher AT content leading to higher number of –/A and –/T SNindels, the number of polyA and polyT tracts further skews the SNindel distribution towards –/A and –/T if slippage is indeed the main mechanism

Sequence context at SNindel polymorphic sites could throw light on mechanisms of generation of such SNindels and effect of neighboring sequence environment. For all 4 types of SNindel, the most abundant nucleotide for 10 positions before and after the SNindel is that of the same base (Fig. 4). Sixty-four percent of SNindels are in regions where there is a repeat of the same base, suggesting unequal recombination, duplicative insertion or slippage-like events as the mechanism. This is consistent with the hypothesis that SNindels arose through slippage and mispairing in regions with multiple units of the same nucleotide during replication or unequal recombination. Proofreading and repair might also be less efficient in regions with stretches of polyN, hence the genome might be less stable in regions with repeats of the same base. In contrast, CpGs are more common among neighboring nucleotide patterns in substitution polymorphisms (Zhao and Boerwinkle, 2002). Overall, 3% of the human genome is made up of tandem repeats which are prone to the generation of indels. Both the abundant Alu repeats and low-copy repeats are thought to predispose the genome to instability either through homologous or non-allelic homologous recombination and cause genomic disorders in extreme cases (Stenger et al., 2001; Stankiewicz and Lupski, 2002). Small indels in gene regions also have the potential to cause more lethal changes compared to substitution mutations. When genome sequences for other primates become available, it will be clearer whether SNindels arise more through insertions or deletions. In any case, SNindels are not likely to contribute significantly to variations in genome size, as that difference is thought to arise primarily through large deletions (Ophir and Graur, 1997; Mouse Genome Sequencing Consortium, 2002).

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

279

4.4. Functional implications

4.5. Conclusions

Indels constitute ∼ 0.7% of all known mutations involved in human genetic diseases (Chuzhanova et al., 2003). Insertions of Alu repeats into genes have been known to cause diseases such as neurofibromatosis type-1, Apert syndrome and various types of cancers (Wallace et al., 1991; Miki et al., 1996; Deininger and Batzer, 1999; Oldridge et al., 1999). L1 transposition has been reported to be responsible for 14 disease-producing transpositions in humans (reviewed in Ostertag and Kazazian, 2001). Small insertions/deletions are also responsible for a number of inherited diseases through the alteration in the number of repeat units. Although there are pathogenic expansions involving repeats of tetramer and pentamer units, the most common type of pathogenic indels is the trinucleotide repeats which are involved in at least 17 diseases (Tan and Lai, 2005). A recent paper investigating pseudogenes found that frequencies of 3bp deletions are higher than those for 2- and 4-bp deletions, and the bias is independent of GC content (Zhang and Gerstein, 2003). However, this is likely to be a reflection of the higher tolerance for 3-bp deletions within coding regions as they do not result in frameshift mutations or lead to premature termination. Similarly, frequency of SNindels is extremely low in coding regions as the effect would be more serious in most cases. Indeed single base deletions are involved in some of the most common Mendelian diseases such as cystic fibrosis and hemophilia (Zielenski and Tsui, 1995; Economou et al., 1992). Nuclear proteins which are involved in recombination, replication, endonuclease and transcription activities typically recognize and bind to 10–20 bp motifs in genomic DNA. As many of these regulatory proteins are dimers or tetramers, their recognition sequences tend to be palindromic as well. Our analysis showed that 11.67% of SNindels are surrounded by palindromic sequences, 2.8% are in direct palindromes and 8.87% are in reverse palindromes which are predominantly mostly around –/A and –/T SNindels. Chromosomal regions containing complex genomic architecture such as AT-rich palindromes are susceptible to double-stranded breaks which might be involved in rearrangement-based disorders (Shaw and Lupski, 2004). Long palindromic sequences have also been observed at many chromosomal breakpoints and in malignancy (Lewis et al., 2005). This analysis only studied short palindromes which may be the precursors of such pathogenic palindromes which are highly polymorphic and unstable (Rattray, 2005). In parallel with the difference in single nucleotide substitution rate between coding and non-coding regions, SNindels are underrepresented in coding regions. In addition, there is a correlation with GC content of individual chromosomes. For all SNindels and those not within mononucleotide repeats, the lowest density is found on the most GC-rich chromosome, chromosome 19 with GC content of 48.33%, and the highest number for chromosome 3, one of the lowest in GC content at 39.86%. This confirms that the generation of such SNindels is largely random.

Patterns of SNindels are of particular interest as they may shed light on important questions such as genome stability, tolerance for single nucleotide gaps, replication fidelity, efficiency of repair mechanisms, and molecular mechanisms of genome evolution. Our results provide a snapshot of the estimated in silico SNindel density across the different chromosomes and the 4 different nucleotide categories. The high correlation between chromosome size and frequencies indicates that the process is largely random. For SNindels which are within gene regions, their distribution within mostly intronic regions is evidence that there are strict control mechanisms against exonic mutations. That SNindels occur mostly within repeats of the same nucleotide supports slippage as one mechanism which generates such variations. For dinucleotide repeats, there is evidence that every genome has its own characteristics in terms of relative abundance irregardless of genome size and GC content (Karlin and Ladunga, 1994; Gentles and Karlin, 2001). As the SNindel data for other genomes is not available, it is not known if there is also unique signature and constancy for each genome. In addition, as each genome has its distribution of isochores, whether the distribution of SNindel is correlated with isochore chromosome maps would also be an interesting investigation.

References Astolfi, A., Bellizzi, D., Sgaramella, V., 2003. Frequency and coverage of trinucleotide repeats in eukaryotes. Gene 317, 117–125. Chuzhanova, N.A., Anassis, E.J., Ball, E.V., Krawczak, M., Cooper, D.N., 2003. Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutat. 21, 28–44. Deininger, P.L., Batzer, M.A., 1999. Alu repeats and human disease. Mol. Genet. Metab. 67, 183–193. Economou, E.P., Kazazian Jr., H.H., Antonarakis, S.E., 1992. Detection of mutations in the factor VIII gene using single-stranded conformational polymorphism (SSCP). Genomics 13, 909–911. Gentles, A.J., Karlin, S., 2001. Genome-scale compositional comparisons in eukaryotes. Genome Res. 11, 540–546. Gu, Z., Wang, H., Nekrutenko, A., Li, W.H., 2000. Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence. Gene 259, 81–88. International Human Genome Sequencing Consortium, 2001. Initial screening and analysis of the human genome. Nature 409, 860–921. Karlin, S., Ladunga, I., 1994. Comparisons of eukaryotic genomic sequences. Proc. Natl. Acad. Sci. U. S. A. 91, 12832–12836. Katti, M.V., Ranjekar, P.K., Gupta, V.S., 2001. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol. 18, 1161–1167. Lander, et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (Erratum in: Nature. 2001. 412, 565. Nature 2001. 411, 720 Szustakowki J [corrected to Szustakowski J]). Lewis, S.M., Chen, S., Strathern, J.N., Rattray, A.J., 2005. New approaches to the analysis of palindromic sequences from the human genome: evolution and polymorphism of an intronic site at the NF1 locus. Nucleic Acids Res. 33, e186. Lutz, S.M., Vincent, B.J., Kazazian Jr., H.H., Batzer, M.A., Moran, J.V., 2003. Allelic heterogeneity in LINE-1 retrotransposition activity. Am. J. Hum. Genet. 73, 1431–1437. Marth, G., et al., 2001. Single-nucleotide polymorphisms in the public domain: how useful are they? Nat. Genet. 27, 371–372.

280

E.-C. Tan, H. Li / Gene 376 (2006) 268–280

Marth, G., et al., 2003. Sequence variations in the public human genome data reflect a bottlenecked population history. Proc. Natl. Acad. Sci. U. S. A. 100, 376–381. Miki, Y., Katagiri, T., Kasumi, F., Yoshimoto, T., Nakamura, Y., 1996. Mutation analysis in the BRCA2 gene in primary breast cancers. Nat. Genet. 13, 245–247. Mouse Genome Sequencing Consortium, 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–526. Oldridge, M., et al., 1999. De novo alu-element insertions in FGFR2 identify a distinct pathological basis for Apert syndrome. Am. J. Hum. Genet. 64, 446–461. Ophir, R., Graur, D., 1997. Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205, 191–202. Ostertag, E.M., Kazazian Jr., H.H., 2001. Biology of mammalian L1 retrotransposons. Annu. Rev. Genet. 35, 501–538. Rattray, A.J., 2005. A method for cloning and sequencing long palindromic DNA junctions. Nucleic Acids Res. 32, e155. Rubin, C.M., Houck, C.M., Deininger, P.L., Freidmann, T., Schmid, C.W., 1980. Partial nucleotide sequence of the 300-nucleotide interspersed repeated human DNA sequences. Nature 284, 372–374. Russell, G.J., Walker, P.M., Elton, R.A., Subak-Sharpe, J.H., 1976. Doublet frequency analysis of fractionated vertebrate nuclear DNA. J. Mol. Biol. 108, 1–23. Shaw, C.J., Lupski, J.R., 2004. Implications of human genome architecture for rearrangement-based disorders: the genomic basis of disease. Hum. Mol. Genet. 13, R57–R64. Smit, 1996. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 6, 743–748. Stankiewicz, P., Lupski, J., 2002. Molecular-evolutionary mechanisms for genomic disorders. Curr. Opin. Genet. Dev. 12, 312–319.

Stenger, J.E., Lobachev, K.S., Gordenin, D., Darden, T.A., Jurka, J., Resnick, M.A., 2001. Biased distribution of inverted and direct Alus in the human genome: implications for insertion, exclusion, and genome stability. Genome Res. 11, 12–27. Tan, E.C., Lai, P.S., 2005. Molecular diagnosis of neurogenetic disorders involving trinucleotide repeat expansions. Expert Rev. Mol. Diagn. 5, 101–109. Toth, G., Gaspari, Z., Jurka, J., 2000. Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 10, 967–981. Wallace, M.R., Andersen, L.B., Saulino, A.M., Gregory, P.E., Glover, T.W., Collins, F.S., 1991. A de novo Alu insertion results in neurofibromatosis type 1. Nature 353, 864–866. Weber, J.L., David, D., Heil, J., Fan, Y., Zhao, C., Marth, G., 2002. Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 71, 854–862. Zhang, Z., Gerstein, M., 2003. Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res. 31, 5338–5348. Zhao, Z., Boerwinkle, E., 2002. Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 12, 1679–1686. Zhao, Z., Fu, Y.X., Hewett-Emett, D., Boerwinkle, E., 2003. Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene 312, 207–213. Zielenski, J., Tsui, L.C., 1995. Cystic fibrosis: genotypic and phenotypic variations. Annu. Rev. Genet. 29, 777–807.