Clonability and gene distribution on human chromosome 21: reflections of junk DNA content?

Clonability and gene distribution on human chromosome 21: reflections of junk DNA content?

GENE AN INTERNATIONAL ,JOURNAL ON GENES AND GENOME5 ELSEVIER Gene 205 (1997) 39-46 Clonability and gene distribution on human chromosome 21" reflec...

843KB Sizes 1 Downloads 49 Views

GENE AN INTERNATIONAL ,JOURNAL ON GENES AND GENOME5

ELSEVIER

Gene 205 (1997) 39-46

Clonability and gene distribution on human chromosome 21" reflections of junk D N A content? Katheleen Gardiner * Eleanor Roosevelt Institute, 1899 Gaylord Street, Denver, CO 80206, USA Accepted 1 September 1997

Abstract

Data from transcriptional mapping of human chromosome 21 have been compiled from a number of sources. Regardless of the gene identification technique used, a consistent picture has developed: the centromere proximal half of 21q, which contains 50% of the DNA (20 Mb), harbors only 10% of the expressed sequences. Because of the variety of gene isolation techniques used, this result is unlikely to arise simply from methodological artefacts, biases in clonability or tissue specificity of expression. This region is known to be AT-rich and to contain APP, the largest gene (spanning 300 kb) currently analyzed on 21q. Interesting preliminary data from analysis of the Fugu rubripes homolog of APP has shown an unusually high, 50-fold, compaction of intron size, raising the intriguing possibility that > 90% of the DNA in the human gene may be functionless. Thus, data from a variety of approaches suggest that a large part of 21q very likely has neither coding capacity nor associated regulatory function. By these criteria, it is a good candidate for a repository of junk DNA. © 1997 Elsevier Science B.V. Keywords: Chromosome bands; cDNA selection; Transcriptional mapping; Exon trapping; dbEST

1. Introduction

D N A is generally considered 'junk' when it has no discernible coding capacity, regulatory function or structural role. By this definition, Giemsa (G) bands are considered to be significantly more junk ridden than Reverse (R) bands. Clear evidence for this conclusion has come from several sources. For example, cytogenetic mapping of genes and cDNAs generally places them within R bands. Isochore analysis goes further (Bernardi, 1995). It has made clear that G bands are uniformly AT-rich, and that while R bands are variable in GC content, the highest G C levels are found in a special subset of R bands. The correlation of gene density with G C level makes explicit the relative genepoorness of G bands. * Tel.: +1 303 3334515; Fax: +1 303333 8423; e-mail: [email protected] Abbreviations: APP, fl-amyloid precursor protein; CBS, cystathionine fl-synthetase; dbEST, database for Expressed Sequence Tags; G bands, Giemsa bands; GCG, Genetics Computer Group; kb, kilobase(s) or 1000 bp; Mb, megabase(s) or 106 bp; R bands, Reverse bands; RH, radiation hybrid; TIGR, The Institute for Genomic Research; YAC, yeast artificial chromosome. 0378-1119/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved. P H S0378-1119(97)00481-2

Nevertheless, the contrast between G and R bands in junk-vs-gene content is sufficiently striking to deserve scrutiny from as many perspectives as possible. The definitive perspective, of course, would be the complete D N A sequence of the human genome. However, while having the complete D N A sequence would go a long way towards final assessment of function, it is naive to assume that we can at this point efficiently and correctly interpret all large-scale human genomic D N A sequence data. It is also unnecessary to wait until the human genome is completely sequenced to gain further information on genome organization. There is a large body of information and resources generated in the course of the human genome initiative that already provides opportunities for a more rigorous look at the molecular nature of G and R bands. Data from human chromosome 21 provide a particularly good model system. The smallest of the human chromosomes, the long arm of chromosome 21, 21q, contains only 40 Mb of D N A (the short arm, 21p, appears to contain largely, if not solely, ribosomal R N A and repetitive sequences; it is not discussed further here). 21q now has detailed physical and clone maps, and a reasonable transcriptional map; it has also been well studied from a disease and biological perspective. Here

K. Gardiner / Gene 205 (1997) 39-46

40

we will specifically review data from the perspective of physical and transcriptional mapping that suggests that a large proportion of 21q very likely has neither coding capacity nor associated regulatory function. Fig. 1 includes a schematic of chromosome 21 at the 800 band stage. Data from several perspectives show that the centromere proximal half of 21q, comprised largely of the G band 21 q21, and the telomere proximal half, largely the R band 21q22, and in particular 21q22.3, are essentially opposites. Here, we will briefly describe the phenotypic and physical/clone maps. We will then spend considerable time reviewing data from transcriptional mapping efforts because coding capacity is currently our best indicator of biological import. Lastly, we will provide some new and preliminary information on 'junk vs function' in G bands deriving from analysis of genomic sequence data and analysis of homologous Fugu rubripes sequences.

regions of 21q. Clones were then used to screen the chromosome 21 specific cosmid library, LL21NCO2-Q (Soeda et al., 1995). Positive clones were analyzed by hybridization and restriction patterns to identify identical and overlapping cosmids (K.G., unpublished).

2.2. Transcriptional mapping." placenta and infant brain cDNAs (K. G., manuscript in preparation) The method of reciprocal cosmid and cDNA library screening (Lee et al., 1995) was used with the LL21NCO2-Q cosmid library and a human placenta cDNA library. Resultant cDNAs were sequenced and verified to map to the indicated regions of 21q. Infant brain cDNAs were obtained from Genethon and regionally localized by hybrid panels described in Gardiner et al. (1995a).

2. Methods

2.3. Genomic sequence

2.1. Cosmid clones from 21qll.2-21q21

DNA sequences in Table 4 were provided by Drs Sakaki and Hattori (University of Tokyo) and by the Stanford Human Genome Center. All sequences are available on the chromosome 21 web site: http://wwweri.uchsc.edu. Accession numbers in Genbank are given where available. Sequence analysis was done with XGRAIL, using Grail 2.

Unique sequence clones were obtained from a microdissection library constructed specifically from this region of 21q (J-W. Yu, personal communication). Clones were verified to map appropriately by hybridization to a panel of somatic cell hybrids containing various

PHYSICAL/CLONE MAPS

a)

] 3-1 ]

GENOMIC ANALYSIS # NOTI MB SITES

20.8

6

8

10

3.6

6

7.6

18

b)

COSMID ANALYSIS __ ] 1 -2X

•]

12- 15 X

C) ~_~

GENOMIC DISTANCE*

YAC ANALYSIS YAC COVERAGE

MISSING

3 Notl fragments 11 Mb

6 Mb

5 Mb

--'7 11 Notlfragments 3.8 Mb

<1 Mb

3 Mb

21 Fig. 1. Physical/clone maps. A schematic of chromosome 21 G and R bands at the 800 band stage is shown (from Francke, 1994). (A) The amount of DNA in each band assumes 40 Mb in 21q and a uniform DNA concentration. The number of NotI sites is from Ichikawa et al. (1993). (B) Coverage of the LL21NCO2-Q cosmid library determined by screening with region specific probes; in [email protected] from Yaspo et al. (1995); in 21q21, Gardiner, unpublished. (C) Data from Gardiner et al. (1995b) in the analysis and verification of the YAC contig of Chumakov et al. (1992). While all 200 markers from the original YAC contig map are present, analysis of the amount of DNA contained in the YACs spanning different regions shows that significant amounts of DNA are unaccounted for.

K. Gardiner / Gene 205 (1997) 39-46

2.4. Fugu rubripes Filters containing the arrayed Fugu cosmid library, constructed by G. Elgar (Elgar et al., 1995), and Fugu cosmid clones were obtained from the German Human Genome Resources Center. Library filters were screened with the human APP cDNA (Tanzi et al., 1987) and cosmid clone N2225 selected for sequence analysis. A 3-kb TaqI fragment hybridizing to APP was subcloned and sequenced. The remainder of the Fugu APP gene was sequenced by primer walking in N2225. Blast searches and GCG programs were used to identify APP exons. Sequencing of 5 kb from one end of N2225 similarly revealed the homolog of the human cystathionine beta synthetase (CBS) gene (K.G. and L. Villard, submitted).

3. Results and discussion

3.1. Phenotypic map Molecular and phenotypic analysis of a large number of individuals that are partially trisomic or partially monosomic for varying regions of 21 q shows two interesting features. First, while there appear to be phenotypic effects associated with monosomy or trisomy of any region of 21q, these effects are significantly less severe when the affected region is within 21q21 (Korenberg et al., 1994). Secondly, when the breakpoints associated with the mono- and trisomy rearrangements are mapped, a much smaller fraction is found within 21q21 than 21q22 (Graw et al., 1995; Korenberg et al., 1994). This may mean either a decreased propensity for rearrangements to occur within 21q21, or a decreased propensity for detection of rearrangements (i.e., due to lack of noticeable phenotypes) within 21q21. The important conclusion, however, is that 21q21 has an undeniable biological role, i.e., it is not totally junk, yet this role is less significant, as judged by the consequences of its disruption, than that of 21q22.

3.2. Physical and clone maps (1) Fig. 1A includes data from the NotI restriction map and from analysis of YAC and cosmid libraries. 21q21 and 21q22 display opposite characteristics. NotI sites, identified by an 8 base pair restriction site that is entirely G + C and that contains two CpG dinucleotides, are estimated to occur on average once per megabase in the human genome. Within 21q21 the average NotI fragment is 3.4 Mb, while in 21q22, it is <600 kb, and in the telomeric region, 21q22.3, it is 400 kb (Ichikawa et al., 1993). Because NotI sites are associated with CpG islands and thus, genes, this has potential implications for

41

variations in gene density and/or in gene size in the associated regions. (2) Fig. 1B includes information from the screening of a chromosome 21 cosmid library. This library (LL21NCO2-Q) was constructed at Lawrence Livermore Laboratory using chromosome 21 material obtained by flow sorting chromosomes from a hamster human hybrid cell line (Soeda et al., 1995). The library comprises 10 000 clones, with average inserts of 35 kb; assuming 50 Mb within chromosome 21 (including the short arm), this implies approximately 7 x coverage. However, when fragments mapping to 21q22.3 were used to screen the library, an average of 12-15 cosmids were identified for each fragment (Yaspo et al., 1995). Conversely, screening the library with fragments from 21qll.2-21q21 yielded an average of only three cosmids, with several being absent from the library completely (K.G., unpublished). This relative under-representation of 21q21 sequences has been seen in the products of many cloning efforts (Gardiner et al., 1990), and seems likely to be due to a combination of sequence features of 21q21, such as high repeat content, unusual patterns in repeat sequences, and a relative paucity of unique sequences with many buried in extensive repetitive domains. (3) Several YAC contigs have been reported to have good coverage of 21q (Chumakov et al., 1992; Nizetic et al., 1994; Gardiner et al., 1995b). However, when distances seen in minimal tiling paths are compared with the known genomic distances, clear discrepancies are seen (Gardiner et al., 1995b). Two regions in particular stand out and are indicated in Fig. 1C. One is in the distal half of 21q21, where a genomic distance of 11 Mb is represented by only 6 Mb in YAC clones. The second region is in distal 21q22.3, where a section that spans almost 4 Mb in genomic DNA, comprises < 1 Mb in YAC DNA. The sequence characteristics of these two regions are quite different (21q21 is AT-rich and 21q22.3 is GC-rich), as are the most likely reasons for the YAC discrepancies. In 21q21, we speculate that YAC coverage is poor because we lack markers for detecting YACs; this would be a result of the poor clonability of this region in cosmid or lambda clones from which the markers to screen YAC libraries are derived. In 21q22.3, markers abound (Cheng et al., 1994; Gardiner et al., 1995a); the problem here is probably the instability of very high GC (>45-50%) content sequences in yeast whose genome averages only 38% GC. Such observations have been made by others, most notably for chromosome 3 and the X chromosome (Pekarsky et al., 1995; Pilia et al., 1993; de Sario et al., 1996).

42

K. Gardiner / Gene 205 (1997) 39-46

3.3. Transcriptional mapping The demonstration that a segment of genomic D N A is transcribed into part of a mature m R N A is a relatively unambiguous demonstration of its functional importance. Several methods now exist for identifying transcribed sequences from genomic DNA. Among the most popular are cDNA hybridization selection, exon trapping, mapping of random cDNAs and analysis of genomic sequence (Lovett, 1994; Church et al., 1993; Schuler et al., 1996; Fickett, 1996). Each of these are, however, associated with some limitations; none can be expected to find all genes within a region, and consistent artefacts and shortcomings can be expected to distort the picture of gene density obtained with any one method. Table 1 lists some cautions to be considered with each of these approaches. Technical difficulties associated with cDNA selection and/or random cDNAs include the failure to generate cDNA that accurately reflects the original population of mRNA. This would be due to inefficiencies of reverse transcriptase caused by strong secondary structures in mRNAs. Other problems include the presence of unprocessed introns in polyA + RNA, and the variable cloning efficiencies of different cDNAs. Results of exon trapping are affected by the variable efficiencies with which the cos cell machinery recognizes not only bone fide splice sites but also illegitimate splice sites. cDNA approaches will obviously miss any genes not expressed in the tissues examined and exon trapping, intronless genes. It is reasonable to anticipate, however, that results from multiple approaches will be at least partially complementary, and taken together will yield a potentially accurate picture of gene distribution. Some of these are summarized here. 3.3.1. cDNA selection and exon trapping Transcriptional mapping efforts on chromosome 21 have included application of cDNA selection to YAC clones of defined map location, application of cDNA selection and exon trapping to random cosmids with subsequent mapping of positive clones, isolation and mapping of placenta and infant brain cDNAs, and regional localization of genes of characterized biochemical function (Tassone et al., 1995; Xu et al., 1995; Cheng et al., 1994; Yaspo et al., 1995; Shimizu et al., 1995). Table 1 Transcriptional mapping methods

Method cDNA selection Exon Trapping

Cautions

Quality of cDNA; Expression in tissues used Expressionindependent;Gene with > 1 intron; Efficiencyof splice sites Mapping of random Qualityof cDNA libraries; Levelof expression; cDNAs tissues of expression Genomic sequencing Reliabilityof gene identificationsoftware

Gene densities in four regions of 21 q derived from these various approaches are summarized in Table 2. It is abundantly clear that success of cDNA selection is strongly affected by the YAC clones chosen, at least with the tissues that were used here (a mixture of fetal brain, whole fetus, testes, liver, thymus and spleen). A difference in gene density of > 100 is seen between a YAC in 21q22.3 and several YACs in 21q21: in the former a gene density of 1 per 10 kb is seen, and in the latter, 1 per 6000 kb. When randomly chosen cosmids are used in cDNA selection and exon trapping (Cheng et al., 1994; Yaspo et al., 1995), a significant difference is again seen between the two regions: 21q21 contains only 15% of the products. A similar picture develops when random placenta and infant brain cDNAs are localized: fewer than 15% are found within 21q21 (K.G., unpublished). The gene distribution seen with these analyses is supported by the mapping of characterized genes. Only 4 of 44 are within the proximal half of 21q (Shimizu et al., 1995). These data must be considered fundamentally different in reliability from those obtained by cDNA selection or exon trapping. They are not derived from efforts directed specifically at chromosome 21, rather these genes were isolated simply by virtue of their biological function (oncogenes, those mutated in disease, homologues of Drosophila, etc.) or biochemical role (steps in enzymatic pathways) and subsequently found to map to chromosome 21. Thus, they are not dependent on genomic clone representation, nor are they directed in a tissue-specific fashion. In all of these cases, then, the approx. 20 Mb of 21q21 apparently contains only 10-15% of genes in spite of containing > 50% of the DNA, and thus it is predicted to be significantly less gene rich than the 20 Mb of 21 q22. How well does this reflect the actual gene density? This is unknown, but two important possibilities to consider include: ( 1) If additional tissues, possibly 'rare' tissues or different developmental times, were examined, would less of a disparity in cDNA selection results with YAC clones be seen? and (2) As noted above, the chromosome 21 cosmid library is biased towards 21q22.3 and against 21q21; would choosing cosmids specifically from 21q21, lessen the biased gene distribution results seen in Table 2? 3.3.2. dbEST and RH mapping An approach that should help to eliminate the problems of tissue specificity and genomic clone representation is the random cDNA sequencing project of TIGR and Merck. These have generated 3' and 5' sequence information from thousands of cDNAs from numerous different tissue-specific cDNA libraries, and now are adding regional mapping of the same cDNAs using panels of whole genome radiation hybrids. Recently, the mapping of 16 000 of these cDNAs was reported

K. Gardiner / Gene 205 (1997) 39-46

43

Table 2 Transcriptional mapping summary Region

q11.2-q21 q22.1 q22.2 q22.3

Mb (%)

20.8 (52%) 8.0 (20%) 3.6 (9%) 7.6 (19%)

Number

Gene densities

cDNAs and exons

Miscellaneous cDNAs

Known genes

cDNAs

Genes

FromYACs

16 28 17 48

3 5 3 11

4 13 7 20

1/1100 1/260 1/180 1/130

1/5400 1/620 1/500 1/380

1/6000 1/50-1/200 -1/10

Amount of DNA in each band is as in Fig. 1. Number of genes/cDNAs obtained from several methods are given, cDNAs and exons: data compiled from Cheng et al. (1994), Yaspo et al. (1995) and Gardiner et al. (1995a). Miscellaneous cDNAs: placenta and infant brain cDNAs (Gardiner et al., unpublished; Gardiner, 1996). Known genes, from mapping of genes isolated for biological functions (summarized in Shimizu et al., 1995). Gene densities were calculated from the number of cDNAs and genes in each band (columns 3-5) and the amount of DNA. Gene densities from YAC clones are based on definition of a non-redundant set of cDNA fragments obtained by cDNA selection from YAC clones targeting specific regions of 21q (Tassone et al., 1995; Xu et al., 1995; Gardiner, 1996). No YACs from G band 21q22.2 were used in this attempt. ( S c h u l e r et al., 1996). Table 3 presents d a t a o b t a i n e d f r o m analysis o f the c h r o m o s o m e 21 i n f o r m a t i o n . D u p l i c a t e d entries were e l i m i n a t e d , as were those for genes a l r e a d y k n o w n to m a p to c h r o m o s o m e 21. This left 113 new expressed sequences, w h o s e a p p r o x i m a t e r e g i o n a l l o c a l i z a t i o n ( i n f e r r e d f r o m genetic m a p d a t a ) is s h o w n in c o l u m n s 2 a n d 3. A g a i n , the d e a r t h o f c D N A s w i t h i n 21q21, < 1 0 % , is clear ( a l t h o u g h here the bias t o w a r d s 21q22.3, which c o n t a i n s 37% o f these c D N A s , is less p r o n o u n c e d t h a n t h a t seen in Table 2). It is also o f interest to n o t e the n u m b e r s o f c D N A s p r e s u m a b l y expressed in b r a i n (fetal, infant o r a d u l t ) b a s e d on their being f o u n d in a p p r o p r i a t e c D N A l i b r a r ies. A t 50%, 21q21 does n o t s t a n d o u t in this r e g a r d f r o m the other, m o r e gene-rich regions. I n f o r m a t i o n on expression levels a n d tissue specificity o f c D N A s f r o m 21q21 c a n also be inferred f r o m the i n f o r m a t i o n o n c o p y n u m b e r in d b E S T a n d f r o m the c D N A l i b r a r y o f origin. A s s h o w n in the last two c o l u m n s o f Table 3, regardless o f the r e g i o n o f c h r o m o s o m e 21 e x a m i n e d , a p p r o x i m a t e l y 4 0 - 5 0 % o f the c D N A s can be classed as ' r a r e ' messages, those f o u n d o n l y once in d b E S T . In a d d i t i o n to brain, the tissues f r o m which these ' r a r e ' c D N A s have been o b t a i n e d are lung, retina, p l a c e n t a a n d (surprisingly) liver. In contrast, o t h e r tissues e x a m i n e d , i n c l u d i n g p a r a t h y r o i d , heart, m e l a n o c y t e s , u t e r u s a n d cochlea, have so far

yielded largely o n l y m o r e c o m m o n l y expressed genes. W h e n l o o k i n g at t r a n s c r i p t s t h a t are highly expressed, b a s e d o n their being f o u n d > 5 times in d b E S T , a u n i f o r m p r o p o r t i o n o f c D N A s ( 1 0 - 2 0 % ) are also found, a g a i n regardless o f region. Thus, n e i t h e r in p r o p o r t i o n o f rare messages n o r in p r o p o r t i o n o f highly expressed messages does 21q21 s t a n d out. N o r is it s h o w i n g a bias in c o n t e n t t o w a r d s messages f r o m tissues u n c o m m o n l y analyzed. These d a t a are suggestive only; they d o n o t present the final picture, b e c a u s e o f the l i m i t a t i o n s o f a n y c D N A l i b r a r y a n d b e c a u s e the sequencing a n d m a p p i n g o f all c D N A s is n o t complete. H o w e v e r , they are i m p o r t a n t because they lessen the p o s s i b i l i t y t h a t ' g e n e - p o o r ' regions simply h a r b o r large n u m b e r s o f tissue-specific genes a n d / o r genes o f low expression levels. 3.3.3. Genomic sequence analysis Sequencing o f 21q has b e g u n a n d is expected to be c o m p l e t e d b y 1999. A t the time o f writing, d a t a for eight segments o f g r e a t e r t h a n 70 k b are a v a i l a b l e a n d p e r m i t some c o m p a r i s o n s o f r e g i o n a l characteristics. Table 4 lists the accession n u m b e r s , c h r o m o s o m a l b a n d o f origin a n d the a m o u n t o f sequence a n a l y z e d in each sample. T h e sequence analysis p r o g r a m , G R A I L ( M u r a l , 1995), has been used to d e t e r m i n e % G C , C p G island density, A l u r e p e a t c o n t e n t a n d d e n s i t y o f L1

Table 3 Chromosome 21 ESTs Region

New homologies

Anonymous

Brain (%)

'Rare' (n = 1)

'High' (n > 5)

qll.2-q21 (cen-App) q22.1 (APP-AML1) q22.2 (AML1-ETS2) q22.3 (ETS2-tel)

1 3 1 1

9 47 10 41

5 (50%) 25 (50%) 6 (55%) 15 (36%)

5 (50%) 25 (50%) 4 (35%) 16 (38%)

1 (10%) 10 (20%) 1 (10%) 9 (21%)

(8%) (44%) (9%) (38%)

Data are adapted from Schuler et al. (1996). Regional localization data were inferred from genetic map data. Redundant ESTs and those for genes already mapped to chromosome 21 (Table 2) were eliminated, leaving 113 new cDNAs. New homologies, those with functional associations. Brain: number and % of cDNAs derived from fetal, infant or adult brain cDNA libraries. Rare: number and % of cDNAs found only once with the entirety of dbEST. High: number and % of cDNAs assumed to be expressed at high levels based on being found > 5 times within the entirety of dbEST. All % are based on the total number of chromosome 21 new ESTs in each region, i.e., the sum of new homologies plus anonymous ESTs.

44

K. Gardiner / Gene 205 (1997) 39-46

Table 4 Sequence analysis Sequence Region

kb

% GC

CpG island density

Alu density

Predicted exon density

Gene models

TU1 TU2 hsac000002 hsac000009 hsac000013 hsac000014 hsac000017 hsac000020

296 416 94 75 70 76 78 67

40.1 40.3 45.8 51.5 57.4 43.3 56.3 60

1/60 1/46 0 1/37 1/12 1/76 1/9 1/13

1/1.1 kb 1/1.6 kb 1/0.95 kb 1/6.3 kb 1/1.3 kb 1/1.5 kb 1/1.0 kb 1/2.0 kb

1/10 1/7 1/10 1/8 1/3 1/5 1/8 1/2.5

- - (APP) 12 (1/35) 2 (1/47) 2 (1/38) 4 (1/18) 3 (1/25) 2 (1/39) 3 (1/22)

q21 q22.2 q22.2 q22.3 q22.3 q22.3 q22.3 q22.3

Genomic sequences of > 67 kb, their total sequence length and their location within 21q were obtained from the chromosome 21 web site and generated by the University of Tokyo sequencing effort (TU1 and TU2, Drs Hattori and Sakaki; Hattori et al., 1997) and the Stanford Genome Center (hsac000002-20; Drs Cox, Myers and Volrath). XGRAIL (Mural, 1995) software was used to determine the base composition and Alu content, and to predict the number, location and density of CpG islands. The GRAIL2 gene finding function was used to predict exons rated as excellent (> 90% probability of being a bone fide exon) and to construct gene models. The density of predicted exons and potential genes is given.

sequences, and to predict exons and gene models. As anticipated, the 300 kb from 21q21 has a low GC%, and a correspondingly low CpG island density. The latter is not, however, dramatically different from the other low GC% segments which map to the other G band, 21q22.2, or to the R band, 21q22.3. Interestingly, the Alu content is not highly variable with GC% or band location, contrary to prior data based largely on cytogenetic levels of resolution (Korenberg and Rykowski, 1988), but consistent with data obtained from analysis of YAC clones in distal Xq (Porta et al., 1993). Indeed, the segment of 21q21 has essentially the same Alu density as segments of 56% and 57% GC mapping to 21q22.3 (the representation of specific Alu family members has not been investigated; this may well vary in a bandspecific or base compositionally related fashion). When GRAIL is asked to predict exons, results are again not dramatically different for the 21q21 segment than for several segments of differing characteristics. The 40% GC 300 kb from 21q21 has one excellent exon per 10 kb, as does a segment of 45% GC; two segments with >50% GC are predicted to have one exon per 8 kb. Unlike other segments, however, in 21q21 there are few clusters of exons, and GRAIL does not predict any convincing gene models. Cautions are in order, however: the 300 kb segment of 21q21 is known to span the APP gene. While GRAIL predicts 54 exons in this sequence, it fails to predict 5 of the 19 exons known to comprise the APP gene. Thus, both the false positive and the false negative rates of exon prediction are high. While GRAIL does significantly better in the more GC-rich regions (it predicts most exons of all the known genes and many new gene models that can be tested), even in the highest GC% segments, GRAIL does not predict a gene density of better than one per 20 kb, significantly less than expected from isochore analysis (Bernardi, 1995). The determinations of base composition, CpG island prediction and Alu content for the segments shown in Table 4 can be considered robust. Beyond these, what

this sequence analysis suggests is that we cannot yet be confident of our skills at gene finding by software analysis.

3.4. Junk vs. function The Japanese pufferfish, Fugu rubripes, has been proposed as a useful addition to the model organism repertoire, in particular for the study of vertebrate genome organization (Elgar et al., 1996; Koop and Nadeau, 1996). Fugu, as a vertebrate, is expected to contain essentially the same set of genes as mammals, yet its genome size, due to smaller introns, shorter intergenic distances and a general lack of repetitive DNA, is only 400 Mb, as opposed to the 3000 Mb of mammals. Preliminary data from the analysis of Fugu homologs of chromosome 21 genes supports the thought that 21q21, or at least genes within 21q21, may contain significant amounts of junk DNA. The Fugu homolog of the human APP gene has been isolated and sequenced (Gardiner and Villard, manuscript submitted). As expected, the intron/exon structure is highly conserved. Intron sizes, however, show a striking compaction. The 18 introns of the human gene average approximately 15 kb in length; the 16 in Fugu average only 240 bp. This gives a compaction factor of approximately 50, much greater than the average factor of 8 in gene compaction predicted from the difference in genome sizes. How unique is the APP gene in this respect? A second gene from chromosome 21, CBS, spans 24 kb in human, and only 5 kb in Fugu (Gardiner and Villard, manuscript in preparation) the L1CAM and the G6PD genes show compactions of 0 and 4 in Fugu (L1CAM is 12 and 13 kb; G6PD, 14 and <4 kb, in human and Fugu, respectively) (Riboldi et al., 1996; Mason et al., 1995). All three of these genes in humans are 50% GC, while APP, as shown in Table 4, is 40%. The Huntington's disease gene is 45% GC in humans, and is compacted 7.5-fold in Fugu (Baxendale et al., 1995). These differences in base composition are rele-

K. Gardiner / Gene 205 (1997) 39-46

vant. Isochore analysis has shown that gene density varies dramatically with GC level (Bernardi, 1995). The most GC-rich regions (>50% GC) average one gene per 5-10 kb, while the most AT-rich regions (38-40% GC) may harbor only one gene in 100-200 kb. Isochore data are supported by observations on intron size, indicating that introns of AT-rich genes are 3-fold larger than those of GC-rich genes (Duret et al., 1995). Clearly, the smaller the human gene, the less the compaction possible in the Fugu homolog. Thus, it is expected that AT-rich human genes will show the greatest compactions. The 50-fold compaction of the introns seen in the APP gene would not, however, have been predicted, and how common this is remains to be seen. If additional large AT-rich genes show similar compactions of 50-fold, this will be good evidence that much of the DNA in large introns is true 'junk', dispensable, at least to be a pufferfish.

3.5. Summary Phenotypic and molecular analysis of individuals carrying deletions or duplications of 21qll.2-q21 suggests that this region harbors sequences of overall less biological importance than those from more distal regions. Known to be AT-rich (and thus gene-poor from isochore analysis), containing few NotI sites (only 6 of 40, although it comprises >20 of the 40 Mb of 21q), and harboring only four out of the total of 44 biochemically characterized genes mapped to 21q, this region is a good candidate for a repository of junk DNA. Assessing the biological importance of a genomic region based on its coding capacity is a popular and reasonable concept. Determining coding capacity of genomic sequence is, however, not trivial; methods to do so are fraught with shortcomings and artefacts. However, the use of a variety of approaches, each characterized by different strengths and weaknesses, has consistently produced the same result for 21q: on average, only 10% of its transcribed sequences are derived from 21ql 1.2-q21. The argument of course can be made that until we have the entire sequence of the human genome and have perfected our ability to identify all genes in genomic DNA, no definitive statement can be made. Nevertheless, strengths of the data presented here include: (i) cDNA selection was carried out with YACs targeted to specific regions of 21q precisely so that comparisons of gene density could be made. In addition, a broad complexity of mRNA was present in the source material based on the tissues chosen. Thus, it is reasonable to conclude that 21qll.2-q21 does not contain many genes expressed in fetal brain or testes or the whole 18-week fetus. (ii) cDNAs used in the RH mapping of dbEST entries included those from analysis of libraries from more exotic tissues, such as retina, cochlea

45

and melanocytes. Expressed sequences from 21 q 11.2-q21 were still underrepresented in these tissues. (iii) While it can be argued that the cDNA selection and exon trapping data from cosmids are biased because of the biased clone representation in the starting cosmid library, this criticism can be addressed by repeating exon trapping using cosmids specifically derived from 21q21. Indeed, this has been done using cosmids obtained from 21q21 microclones. Preliminary results indicate that relatively few exons were obtained, approximately 25% of the number obtained when using the same number of cosmids from 21q22 (K.G. and M.-L. Yaspo, unpublished observations). (iv) Finally, analysis of the Fugu homolog of the large (300kb), AT-rich (40% GC) human APP gene indicates that more than 90% of this genomic sequence can be eliminated and still produce a functioning vertebrate. If 21q21 is really home to a small number of large AT-rich genes, as accumulating data suggests for G bands in general, this strengthens the hypothesis that they are also home to large quantities of dispensable DNA. Data from the Human Genome Initiative and, in particular, from transcriptional mapping efforts have greatly increased the detail in our picture of human genome organization. Data from human chromosome 21 have strengthened the hypothesis that some G bands have relatively little protein coding capacity, and, as such, are good candidates for harboring quantities of junk DNA.

Acknowledgement This work was supported by grant HD17749 from the National Institutes of Health. It is contribution #1637 from the Eleanor Roosevelt Institute. The author wishes to thank M.-L. Yaspo and the Resource Center of the German Human Genome Project for the Fugu cosmid clone.

References Baxendale, S., Abdulla, S., Elgar, G., Buck, D., Berks, M., Micklem, G., Durbin, R., Bates, G., Brenner, S., Beck, S., Lehrach, H., 1995. Comparative sequence analysis of the human and pufferfish Huntington's disease genes. Nature Genet. 10, 67-77. Bernardi, G., 1995. The human genome: organization and evolutionary history. Annu. Rev. Genetics 29, 445 476. Cheng, J.-F., Boyartchuk, V., Zhu, Y., 1994. Isolation and mapping of human chromosome 21 cDNA: progress in constructing a chromosome 21 expression map. Genomics 23, 75-84. Chumakov, I., Rigault, P., Guillou, S., Ougen, P., Billaut, A., Guasconi, G. et al., 1992. A continuum of overlapping clones spanning the entire human chromosome 21q. Nature 359, 380-387. Church, D.M., Stotler, C.J., Rutter, J.L., Murrell, J.R., Trofatter, J.A., Buckler, A.J., 1993. Isolation of genes from complex sources of

46

K. Gardiner / Gene 205 (1997) 39-46

mammalian genomic DNA using exon amplification. Nature Genet. 6, 98-105. de Sario, A., Geigl, E., Palmieri, G., D'Urso, M., Bernardi, G., 1996. A compositional map of human chromosome band Xq28. Proc. Natl. Acad. Sci. USA 93, 1298-1302. Duret, L., Mouchiroud, D., Gautier, C., 1995. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308-317. Elgar, G., Rattray, F., Greystrong, J., Brenner, S., 1995. Genomic structure and nucleotide sequence of the p55 gene of the puffer fish Fugu rubripes. Genomics 27, 442-446. Elgar, G., Sandford, R., Aparicio, S., Macrae, A., Venkatesh, B., Brenner, S., 1996. Small is beautiful: comparative genomics with the pufferfish (Fugu rubripes). Trends Genet. 12, 145 150. Fickett, J.W., 1996. Finding genes by computer: the state of the art. Trends Genet. 12, 316 320. Francke, U., 1994. Digitized and differentially shaded human chromosome ideograms for genomic applications. Cytogenet. Cell Genet. 65, 206 219. Gardiner, K., 1996. Base composition and gene distribution: critical patterns in mammalian genome organization. Trends Genet. 12, 519-524. Gardiner, K., Horisberger, M., Kraus, J., Tantravahi, U., Korenberg, J., Rao, V., Reddy, S., Patterson, D., 1990. Analysis of human chromosome 21: correlation of physical and cytogenetic maps; gene and CpG island distributions. EMBO J. 9, 25 34. Gardiner, K., Ichikawa, H., Ohki, M., Patterson, D., Cheng, J.-F., 1995a. Localization of cDNAs to a region poorly represented in the CEPH chromosome 21 YAC contig: candidate genes for genetic diseases mapped to 21q22.3. Genomics 30, 376-379. Gardiner, K., Graw, S., Ichikawa, H., Ohki, M., Joetham, A., Gervy, P., Chumakov, I., Patterson, D., 1995b. YAC analysis and minimal tiling path construction for chromosome 21q. Somat. Cell Mol. Genet. 21,399-414. Graw, S.L., Gardiner, K., Hall-Johnson, K., Hart, I., Joetham, A., Walton, K., Donaldson, D., Patterson, D., 1995. Molecular analysis and breakpoint definition of a set of human chromosome 21 somatic cell hybrids. Somat. Cell Mol. Genet. 21,415-428. Hattori, M., Tsukahara, F., Furuhata, Y., Tanahashi, H., Hirose, M., Saito, M., Tsukuni, S., Sakaki, Y., 1997. A novel method for making nested deletions and its application for sequencing of a 300 kb region of human APP locus. Nucleic Acids Res 25, 1802-1808. Ichikawa, H., Hosoda, F., Arai, Y., Shimizu, K., Ohira, M., Ohki, M., 1993. A NotI restriction map of the entire long arm of human chromosome 21. Nature Genet. 4, 361-365. Koop, B.G., Nadeau, J.H., 1996. Pufferfish and a new paradigm for comparative genome analysis. Proc. Natl. Acad. Sci. USA 93, 1363-1365. Korenberg, J.R., Chen, X.-N., Schipper, R., Sun, Z., Gonsky, R., Gerwehr, S., Carpenter, N., Daumer, C., Dignan, P., Disteche, C., Graham Jr., J.M., Hugdins, L., McGillivray, B., Miyazaki, K., Ogasawara, N., Park, J.P., Pagon, R., Pueschel, S., Sack, G., Say, B., Schuffenhauer, S., Soukup, S., Yamanaka, T., 1994. Down syndrome phenotypes: the consequences of chromosomal imbalance. Proc. Natl. Acad. Sci. USA 91, 4997-5001. Korenberg, J., Rykowski, M.C., 1988. Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands. Cell 53, 391400. Lee, C.C., Yazdani, A., Wehnert, M., Zhao, Z.Y., Lindsay, E.A., Bailey, J., Coolbauch, M.I., Couch, L., Xiong, M., Chinault, A.C., Baldini, A., Caskey, C.T., 1995. Isolation of chromosome-specific genes by reciprocal probing of arrayed cDNA and cosmid libraries. Hum. Mol. Genet. 4, 1373-1380.

Lovett, M., 1994. Fishing for complements: finding genes by direct selection. Trends Genet. 10, 352-357. Mason, P.J., Stevens, D.J., Luzzatto, L., Brenner, S., Aparicio, S., 1995. Genomic structure and sequence of the Fugu rubripes glucose6-phosphate dehydrogenase gene (G6PD). Genomics 26, 587 591. Mural, R.J., 1995. Identifying candidate genes in genomic DNA: gene discovery and sequence annotation in the GRAIL-genQuest environment. In: Current Protocols in Human Genetics, John Wiley, New York, NY, pp. 6.5.1 6.5.22. Nizetic, D., Gellen, L., Hamvas, R.M., Mott, R., Grigoriev, A., Vatcheva, R., Zehetner, G., Yaspo, M.L., Dutriaux, A., Lopes, C. et al., 1994. An integrated YAC-overlap and 'cosmid-pocket' map of the human chromosome 21. Hum. Mol. Genet. 3, 759 770. Pekarsky, Y., Zabarovsky, E., Kashuba, V., Drabkin, H., Sandberg, A., Morgan, R., Rynditch, A., Gardiner, K., 1995. Cloning of breakpoints in 3q21 associated with hematological malignancy. Can. Genet. Cytogenet. 80, 1-8. Pilia, G., Little, R., Aissani, B., Bernardi, G., Schlessinger, D., 1993. Isochores and CpG islands in YAC contigs in human Xq26.1-qter. Genomics 17, 456 462. Porta, G., Zucchi, I., Hillier, L., Green, P., Nowotny, V., D'Urso, M., Schlessinger, D., 1993. Alu and L1 sequence distribution in Xq24-q28 and their comparative utility in YAC contig assembly and verification. Genomics 16, 417-423. Riboldi, G., Nyakatura, G., Taudien, S., Coutelle, O., Perret, X., Elgar, G., Brenner, S., Platzer, M., Drescher, B., Rosenthal, A., 1996. Comparative Sequencing in Human and the Pufferfish FUGU: Analysis of 17 FUGU Genes Including Homologues of Human L1CAM, G6PD and P55 Located in Xq28. Genome Mapping and Sequencing. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, p. 205. Schuler, G.D., Boguski, M.S., Stewart, E.A., Stein, L.D., Gyapay, G., Rice, K. et al., 1996. A gene map of the human genome. Science 274, 547-562. Shimizu, N., Antonarakis, S.E., Van Broeckhoven, C., Patterson, D., Gardiner, K., Nizetic, D., Cr6au, N., Delabar, J-M., Korenberg, J., Reeves, R., Doering, J., Chakravati, A., Minoshima, S., Ritter, O., Cuticchia, J., 1995. Report of the Fifth International Workshop on Human Chromosome 21 Mapping 1994. Cytogenet. Cell Genet. 70, 147 182. Soeda, E., Hou, D-X., Osoegawa, K., Atsuchi, Y., Yamagata, T., Shimokawa, T., Kishida, H., Soeda, E., Okano, S., Chumakov, I., Cohen, D., Raft, M., Gardiner, K., Graw, S.L., Patterson, D., De Jong, P., Ashworth, L.K., Slezak, T., Carrano, A.V., 1995. Cosmid assembly and anchoring to human chromosome 21. Genomics 25, 7384 Tanzi, R.E., Gusella, J.R., Watkins, P.C., Bruns, G.A.P., St. GeorgeHyslop, P., Van Keuren, M.L., Patterson, D., Pagan, S., Kurnit, D.M., Neve, R.L., 1987. Amyloid fl protein gene: cDNA, m R N A distribution, and genetic linkage near the Alzheimer locus. Science 235, 880 884. Tassone, F., Xu, H., Burkin, H., Weissman, S., Gardiner, K., 1995. cDNA selection from 10 Mb of chromosome 21 DNA: efficiency in transcriptional mapping and reflections of genome organization. Hum. Mol. Genet. 4, 1509-1518. Xu, H., Wei, H., Tassone, F., Graw, S., Gardiner, K., Weissman, S.M., 1995. A search for genes from the dark band regions of human chromosome 21. Genomics 27, 1-8. Yaspo, M.-L., Gellen, L., Mott, R., Korn, B., Nizetic, D., Poustka, A., Lehrach, H., 1995. Model for a transcript map of human chromosome 21: isolation of new coding sequences from exon and enriched cDNA libraries. Hum. Mol. Genet 4, 1291 1304.