Gene 318 (2003) 137 – 147 www.elsevier.com/locate/gene
Mouse models of Down syndrome: how useful can they be? Comparison of the gene content of human chromosome 21 with orthologous mouse genomic regions Katheleen Gardiner a,b,*, Andrew Fortna a, Lawrence Bechtel c, Muriel T. Davisson c b
a Eleanor Roosevelt Institute at the University of Denver, 1899 Gaylord Street, Denver, CO 80206-1210, USA Department of Biochemistry and Molecular Genetics, University of Colorado Health Sciences Center, 4200 East 9th Avenue, Denver, CO 80206, USA c The Jackson Laboratory, 600 Maine Street, Bar Harbor, ME 04609, USA
Received 22 April 2003; received in revised form 10 June 2003; accepted 20 June 2003 Received by G. Bernardi
Abstract With an incidence of approximately 1 in 700 live births, Down syndrome (DS) remains the most common genetic cause of mental retardation. The phenotype is assumed to be due to overexpression of some number of the >300 genes encoded by human chromosome 21. Mouse models, in particular the chromosome 16 segmental trisomies, Ts65Dn and Ts1Cje, are indispensable for DS-related studies of gene – phenotype correlations. Here we compare the updated gene content of the finished sequence of human chromosome 21 (364 genes and putative genes) with the gene content of the homologous mouse genomic regions (291 genes and putative genes) obtained from annotation of the public sector C57Bl/6 draft sequence. Annotated genes fall into one of three classes. First, there are 170 highly conserved, human/mouse orthologues. Second, there are 83 minimally conserved, possible orthologues. Included among the conserved and minimally conserved genes are 31 antisense transcripts. Third, there are species-specific genes: 111 spliced human transcripts show no orthologues in the syntenic mouse regions although 13 have homologous sequences elsewhere in the mouse genomic sequence, and 38 spliced mouse transcripts show no identifiable human orthologues. While these species-specific genes are largely based solely on spliced EST data, a majority can be verified in RNA expression experiments. In addition, preliminary data suggest that many human-specific transcripts may represent a novel class of primate-specific genes. Lastly, updated functional annotation of orthologous genes indicates genes encoding components of several cellular pathways are dispersed throughout the orthologous mouse chromosomal regions and are not completely represented in the Down syndrome segmental mouse models. Together, these data point out the potential for existing mouse models to produce extraneous phenotypes and to fail to produce DS-relevant phenotypes. D 2003 Elsevier B.V. All rights reserved. Keywords: Genomic sequence annotation; Sequence conservation; Ts65Dn; Spliced EST; Nonhuman primate; Species-specific
1. Introduction With an incidence of one in 700 live births, Down syndrome (DS, trisomy 21) is the most common genetic cause of mental retardation (Hassold and Jacobs, 1984). Although the complete phenotype is complex, variable in
Abbreviations: DS, Down syndrome; EST, Expressed Sequence Tag; UTR, untranslated region; Mb, megabase; orf, open reading frame. * Corresponding author. Eleanor Roosevelt Institute at the University of Denver, 1899 Gaylord Street, Denver, CO 80206-1210, USA. Tel.: +1303-336-5652; fax: +1-303-333-8423. E-mail address:
[email protected] (K. Gardiner). 0378-1119/$ - see front matter D 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0378-1119(03)00769-8
severity, and may include immune deficiencies, heart defects, increased risk of leukemia, and early development of the pathology of Alzheimer’s disease, the common feature among DS individuals is the presence of mental retardation characterized by specific cognitive and behavioral deficits (Epstein, 1995; Pennington et al., 2003). DS is caused by an extra copy of the long arm of human chromosome 21, which, since completion of the finished sequence (Hattori et al., 2000), is known to span f 33.5 Mb of DNA and contain f 300 genes. As with other human genetic diseases, analysis of gene function and gene – phenotype correlations relies heavily on the use of mouse models. In DS, this is complicated not only by the large
138
K. Gardiner et al. / Gene 318 (2003) 137–147
number of candidate genes, but also by the fact that the mouse orthologues of chromosome 21 genes are distributed among three regions of the mouse genome: the distal f 23 Mb of Chr16, the centromere f 1.1 Mb of Chr 17 and an internal f 2.3 Mb of Chr10. Thus, creating a mouse model that completely recapitulates the genomic content of human DS is enormously challenging. The current mouse models are the segmental trisomies, Ts65Dn (Davisson and Costa, 1999; Akeson et al., 2001) and Ts1Cje (Sago et al., 1998, 2000), which are trisomic for the distal f 17 and f 8.3 Mb, respectively, of mouse chromosome 16. Both display relevant phenotypic features, including learning and behavioral deficits (Davisson and Costa, 1999). Determining complete gene content and gene function are central issues in assessing how useful segmental trisomy mouse models are, or can be. The complete DS phenotype is presumed to develop from the increased activity of the overexpressed chromosome 21 genes’ products. Most obviously gene activity includes the direct action of chromosome 21 encoded proteins on their substrates. More subtly, it includes their interactions with other proteins within complexes and pathways. Additional complexities arise when multiple chromosome 21 proteins mutually interact within a single pathway or complex. A mouse model that is not trisomic for all chromosome 21 homologues that function within a specific pathway or complex may not reproduce critical features of the phenotype. Lastly, the phenotypes of mouse segmental trisomies cannot be expected to completely recapitulate the DS phenotype if there are differences in the gene content of orthologous human and mouse genomic regions, i.e. if there are species-specific genes. While a comparative analysis of the mouse draft genomic sequence and the mouse transcriptome (Mouse Genome Sequencing Consortium, 2002; The Fantom Consortium, 2002), with special relevance to human chromosome 21 (Reymond et al., 2002a; Gitton et al., 2002), have been described recently, to evaluate current mouse models of Down syndrome and to make efficient decisions regarding the need for and nature of additional models, it is critical to compare in more detail the complete gene content of human chromosome 21 with that of the orthologous mouse genomic regions. Here we describe results of reviewing the annotation of 27 Mb of mouse genomic draft sequence. Human and mouse genes and putative genes are divided into
three categories: genes that are highly conserved and likely orthologues, i.e. highly similar between the two organisms throughout essentially all coding exons; genes that are minimally conserved, i.e. only one or a few exons are similar, levels of similarity are low, and validated gene structures are available for only one organism; and putative genes that appear to species-specific, i.e. no exons show any similarity at the DNA sequence level to the other organism within the homologous region. In each category, genes are further described by length of open reading frame (if any) and presence of functional domains. Of 145 genes that are protein coding with functional association, 140 are conserved. The second largest category of transcripts is speciesspecific, of which 111 have been identified in human and 38 in mouse. Lastly, the distribution of all classes of genes among the human chromosome 21 homologous mouse regions and within the trisomic segments of Ts65Dn and Ts1Cje are described. Together, these data define strengths and potential limitations of using mouse models to study gene– phenotype correlations in DS.
2. Material and methods Table 1 lists the accession numbers of the contigs of finished human chromosome 21 sequence and draft public sector C57Bl/6J mouse sequence (NCBI mouse build 30, Feb. 10, 2003) that have been annotated. Mouse sequence contigs were identified by BLASTN searches with cDNA sequences of human chromosome 21 genes or, when available, their mouse homologues. Human and mouse contigs were divided into segments of 170 kb and annotated using the suite of programs available in the Genotator software (Harris, 1997) as modified (Fortna and Gardiner, 2001), comprising standard gene-finding tools and approaches. Briefly, each segment is annotated for base composition, CpG islands, coding exons (Grail.exp, Genscan, Mzef, Fgenes and Fgenesh), intra-species spliced EST models (EST-to-Genome; >95% identity within a single exon with matches over the entire EST sequence), BLASTN matches to intra-species unspliced ESTs (>95% identity for the entire EST), BLASTN inter-species matches to rat, mouse or human ESTs (>70% identity over >80 nucleotides), BLASTX matches to GenPept, BLASTN matches to tran-
Table 1 Genomic contigs Human contigsa
GenBank accession no.
Mouse contigsa
Ensembl accession no.
Chromosome 21 gene content
Mouse Chr; model region
002-1 005-1 027-2 057-1 090-2 097-1
AL163202-204 AL163204-227 AL163227-256 AL163257-290 AL163290-297 AL163297-306
NA 16_1-1 to 16_19-1 Ts65Dn_01-1 to Ts65Dn_26-2 Ts65Dn_26-2 to Ts65Dn_50-2 17_01-2 to 17_04-2 10_09-2 to 10_02-2
NA 16.76000001-82200000 16.82200001-90870000 16.90870001-99184200 17.30170001-31310000 10.78915001-76640000
Cen-Pred5 Pred5-NCAM2 >NCAM2-SOD1 >SOD1-ZNF295 UMODL-KIAA0179 PDXK-TEL
NA Chr16; Ts65Dn Chr16; + Ts65Dn, Ts1Cje + Ts65Dn, + Ts1Cje Chr17; NA Chr10; NA
a
to to to to to to
004-2 027-1 056-2 090-2 097-1 106
Human and mouse genomic sequences were divided into contigs of 340 kb; each contig was annotated in two segments of 170 kb.
K. Gardiner et al. / Gene 318 (2003) 137–147
scripts of known cDNAs (RefSeq), and human/mouse conserved genomic sequences using PipMaker (Schwartz et al., 2000) (reporting matches >65% identity over >50 nucleotides). Additional human-only annotation includes conservation with nonhuman primate EST and genomic sequence (GenBank/NCBI) and data from chromosome 21 genomic oligonucleotide array screening with cDNA from tissue culture cells (Kapranov et al., 2002). Mouse only annotation includes BLASTN matches to the Fantom cDNAs (2002). All annotations were hand-curated; gene models were added and corrected to incorporate experimental data. All graphical annotations and Supplemental Tables 1 –3 (human gene list, mouse gene list and gene functional annotation, respectively) can be found at http://eri.uchsc.edu/ chromosome21/index.html. Minimal criteria for annotating a gene were a spliced EST with at least two exons with consensus splice sites, or consistent prediction by at least three programs of at least three coding exons within 50 kb. Unspliced ESTs were not considered in the absence of other supporting evidence. Because standard analyses were used, it is not anticipated that more or different genes from those in the public databases were identified other than from possible novel dbEST entries. The value here is in the curation and comparison of gene contents. Genes were classed as highly conserved if there was evidence of gene structures (consistent exon prediction and/ or spliced ESTs) in both human and mouse genomic sequence and if the majority of coding exons showed >70% identity between species. Genes were classed as minimally conserved if evidence of gene structure was present in only one organism and/or few exons showed even >65% identity over >50 nucleotides. There was no ambiguity in assigning genes between the two categories. Genes were classed as putative and species-specific within the orthologous genomic regions if they failed to meet the latter criteria.
139
Genes and putative genes were classed as protein coding if the model contained an open reading frame of >40 amino acids. Functional associations of encoded proteins were assigned based on matches to proteins of known function or by domain identification by SMART (http://us.expasy.org/ tools/).
3. Results As shown in Table 1, genomic sequence homologous to the 33.5 Mb of human chromosome 21 spans 23.2 Mb of mouse chromosome 16, 1.1 Mb of mouse chromosome 17 and 2.3 Mb of mouse chromosome 10. In the centromeric f 1.4 Mb of chromosome 21, no similarity to mouse chromosome 16 sequences could be detected. The number of genes and putative genes within each segment is listed in Table 2. The chromosome 16 segment has been further divided into three regions that are relevant to the segmental trisomy mouse models: the centromere proximal region that is absent in both models, the adjacent segment that is trisomic only in Ts65Dn, and the telomere proximal segment that is trisomic in both Ts65Dn and Ts1Cje. Ts65Dn contains chromosome 16 genes distal to Ncam2 (Akeson et al., 2001) and Ts1Cje, those distal to Sod1 (Sago et al., 1998). Because the mouse sequence is draft, there are numerous gaps; these are more frequent in the chromosome 17 and 10 segments, and are also notable at CpG islands. For comparison, we similarly annotated the Celera draft sequence, but no additional genes were found; on the contrary, a small number of genes were absent and CpG islands were less well represented. 3.1. Total gene numbers The number of genes and putative genes identified on human chromosome 21 in the current review of annota-
Table 2 Distribution of human chromosome 21 genes and mouse genes in homologous genomic regions Gene category
Highly conserved 1
Chromosomal region
2
3
F
Minimally conserved 4
5
6
Human-specific, putative F
7
Mouse-specific, putative
P+F
P
RNA
H
M
P+F
P
PP
RNA
P+F
P
PP
RNA
16:Cen-Ncam2 16:>Ncam2-Sod1* 16:>Sod1-Znf295*,** 17:Umod-Kiaa0179 10:Pdxk-Tel
11 20 60 16 33
6 0 10 3 6
1 3 1 0 0
13 14 24 2 8
3 1 11 1 6
1 0 1 0 0
0 15 22 1 8
17 5 15 0 11
3 8 2 0 2
0 0 2 0 0
3 5 6 1 3
F
0 4 5 0 1
1 4 2 1 0
Total genes H:364, M:291 Ts65Dn: 136/(49)/308 Ts1Cje: 97(21)/158
140
25
5
61
22
2
46
48
15
2
18
10
8
80 60
10 10
4 1
38 24
9 8
(0) (1)
(37) (22)
(20) (15)
(10) (2)
2 2
10 6
9 5
5 1
1 Homologous mouse chromosome number, plus gene boundaries. 2Protein coding with functional association. 3Protein coding, no domains or motifs. For highly conserved, likely orthologous, genes, complete gene structures at least through the coding regions have been verified for P + F and P F genes in at least one organism or are evident from spliced EST data. 4Putative functional RNA gene based on no open reading frame >40 amino acids. 5Human. 6Mouse. 7 Putative protein coding based on an open reading frame >40 amino acids, but gene structure cannot be assumed to be complete. *, Region trisomic in Ts65Dn; **, region trisomic in Ts1Cje. 8Gene content of Ts65Dn and Ts1Cje, number of genes conserved with human/(number of human-specific absent)/number of mouse-specific present.
140
K. Gardiner et al. / Gene 318 (2003) 137–147
tions is 364, as compared to the 225 reported when the sequence was first published (Hattori et al., 2000). While a number of gene models were simply missed in the original annotation, the chief reason for the increase in number is the growth within the last 2 years in the complexity of dbEST, particularly of complete cDNA projects, the Orestes project, the NCI cancer anatomy project, and efforts to normalize and subtract new cDNA libraries and thus sequence predominantly novel but possibly rare cDNAs (Kikuno et al., 2002; Camargo et al., 2001; Bonaldo et al., 1996). The majority of the new gene models are based solely upon spliced ESTs and fall into the categories of minimally conserved and human-specific. Their inclusion is warranted at this time because (i) many encode open reading frames, (ii) when tested experimentally for expression, more than 90% can be verified, and (iii) preliminary data suggest many are conserved at least among primates (see below). By similar methods, 291 gene models were identified in the homologous mouse genomic regions. Complete gene lists are found in Supplementary Tables 1 and 2. 3.2. Highly conserved genes 3.2.1. Protein coding One hundred sixty-four protein-coding genes are highly conserved between human and mouse by the criterion that the majority of coding exons are >70% identical and represent likely orthologues. Segments of conservation typically span an entire coding exon plus f 10 – 20 nucleotides on either side within adjacent introns Essentially all genes in this class are identified by consistent coding exon prediction and spliced ESTs. Of the 164, 140 can be described with some functional association, by a functional domain or motif, or by nonrandom amino acid composition. These are listed in Supplementary Table 3.
3.2.2. Putative functional RNAs The BIC gene (Tam, 2001), mapping to 21q21, has been proposed as a functional RNA. cDNAs isolated from chicken, human and mouse lack open reading frames, but share sequence similarity (between human and mouse) of 85% identity over 221 nucleotides through a region of one exon that is predicted to fold in an evolutionarily conserved structure. Published cDNAs for human and mouse comprised two exons; recent spliced ESTs for the human gene show an additional two exons at the 5V end that lie within a CpG island. Functions of this RNA are unknown; the first BIC cDNA was isolated from chicken after activation by a nearby viral insertion. Five additional cDNAs identified in mouse libraries show similar characteristics to BIC: only one exon in each of the multi-exon transcripts shows any sequence similarity with human, but this similarity is long (f 190 to f 300 nt) and high (74 –87%), and encodes no open reading frame. These cDNAs have been provisionally classified as deriving from conserved, functional RNA genes (see Supplementary Table 4). Gitton et al. (2002) and Reymond et al. (2002a) recently reported expression analysis of a large number mouse orthologues of human chromosome 21 genes. Relative to these reports, our annotation shows several differences, represented by (Supplementary Table 4): (i) inclusion of 15 genes defined by ESTs from only one organism but where exon conservation is >75%; (ii) exclusion of six apparent pseudogenes defined as intronless models where the best match in the second organism maps to a nonsyntenic region and contains introns; (iii) exclusion of 17 models poorly supported, i.e. not meeting minimal criteria as described in the Methods; (iv) exclusion of 10 models shown to be parts of other genes; and (v) reclassification of three genes with limited conservation as minimally conserved genes. Such comparisons illustrate the need for ongoing curation coupled to experimental validation.
Table 3 Minimally conserved genes Human accession no.
Contig1
Orf characteristics2
Exon conservation3
#aa
Exons
Orf cons
1
2 546/0/0 244/0/0 215/0/0 204/35/87% 375/0/0
42/0/0 223/0/0 115/0/0 227/116/78%
BG221771 C21orf42 AA442273 BG221729 BI226939
7-2 38-2 41-1 49-2 56-2
>52 81 na >73 144
1 1, 2 na 1 1
pc nc na nc c
BF204217
61-1
64
3
pc
239/126/84% 524/0/0 66/142/79% 124/0/0 346/95/68% + 82/82% + 48/84% + 37/96% 70/0/0
BI087343 AJ003458 BG721537 BG206326
64-1 74-2 78-1 102-1
>56 na na 73
1 na na 2
pc na na pc
270/74/72% + 65/83% 134/75/65% 59/0/0 92/0/0
1
104/0/0
3
4
5
95/83/69% 123/55/82%
513/55/65% 166/64/77%
72/0/0
173/69/71% + 37/66% 121/0/0
371/0/0 79/0/0
229/0/0
570/67/72%
Location within chromosome 21. 2Orf (open reading frame) characteristics: #aa, no. of amino acids in the orf (na, no orf >40aa); Exons: exon no. where orf is located; Orf cons, level of conservation of the orf: c, completely conserved; pc, partially conserved; na, no conservation. 3Exon length/conserved segment length/% identity.
K. Gardiner et al. / Gene 318 (2003) 137–147
141
Table 4 Characteristics of species-specific putative genes
Human No orf Complete orf
Incomplete orfs
Mouse No orf Complete orfs
Incomplete orfs
No. of exons
1
2
3
4
5
6
7
8
9
10
No. of genes No. of genes AA range Average orf No. of genes AA range Average orf
0 2 144,148 146 2 247,297 >272
7 18 48 – 165 71 10 >50 to >113 >70
8 10 42 – 145 81 9 >40 to >139 >88
0 15 45 – 113 62 8 >42 to >210 >114
0 5 46 – 214 124 3 >89 to >100 >95
0 6 45 – 142 76 3 >129,>390 >236
0 0 na na 0 na na
0 0 na na 0 na na
0 0 na na 1 >266 na
0 1 83 na 0 na na
No. of genes No. of genes AA range Average orf No. of genes AA range Average orf
0 2 131 – 167 149 0 na na
5 5 61 – 131 90 6 >53 to >116 >65
2 6 46 – 88 65 2 96,105 >100
1 2 57 na 2 62,67 >63
0 1 57 na 0 na
0 1 121 na 0 na
0 0
0 0
0 0
0 1 794 na
na, not applicable.
3.3. Minimally conserved genes Minimally conserved genes were identified as spliced ESTs or, rarely, consistent exon predictions present for only human or mouse, but not both, and for which only one or a few exons in a model show sequence conservation. In addition, while in some models the level of individual exon conservation is significant, for many, it is < 70% and occurs in segments that do not span the entire exon, a feature that contrasts with conserved protein-coding genes. The location and strand specificity of these ESTs suggests that generally they are unlikely to represent 5VUTR structures of adjacent, characterized genes. Lastly, where a complete open reading frame is present in the model, in some cases, it does not overlap with the conserved segment. These may also be candidates for functional RNA genes. The structural fea-
tures and conservation of 10 examples of minimally conserved genes illustrating these characteristics are shown in Table 3. Sixty-one minimally conserved genes were found in human and 22 in mouse. One of these, DSCR4, has been similarly reported by Toyoda et al. (2002). Again, the lower numbers in mouse likely reflect the lesser complexity of mouse dbEST. Data in Supplementary Table 5 show that these genes are dispersed throughout the chromosomal regions with no clustering. Translation (see Supplementary Table 5) shows that 16 human and four mouse genes have no open reading frame >40 amino acids, although these clearly may be incomplete at the 5Vends. Of those encoding putative complete open reading frames, 25 human and 17 mouse proteins range from f 50 to >200 amino acids. The remaining genes in both species, 10 human and one mouse,
Table 5 Expression of human-specific genes Gene1
Tissues of expression4
Gene2
Tissues of expression4
PRED16 C21orf109 C21orf49 PRED4 PRED21 TCP10L C21orf20 C21orf21 C21orf22 C21orf54 AI636634 BI831702 BI833569 DSCR93 DSCR103
+ testes brain, lymphoblasts U (liver, heart) U (uterus, spleen, bone marrow, liver, heart) U (colon) + testes + testes U (colon) + testes + testes, muscle (brain, kidney) + brain, testes, kidney, muscle + brain, testes, lymphoblasts (liver) + liver, bone marrow (brain, testes, lymphoblast) + testes + testes
C21orf15 C21orf81 C21orf100 D21S090E D21S091E C21orf82 C21orf65/DSCR8 C21orf88 C21orf84 C21orf90 C21orf89 C21orf86 C21orf93 C21orf99 C21orf87 C21orf67/69
U U (liver, kidney) + prostate testes, spinal cord trachea testes U (breast) U (small intestine, breast, liver) U (heart) U (heart, muscle) U (liver, muscle) + 10 of 20 + thymus U + testes, prostate, brain, placenta + lung, placenta Negative 15 adult, 5 fetal
1
Gardiner et al., 2002; Gardiner and Fortna, unpublished data. 2Reymond et al., 2001, 2002b. 3Takamatsu et al., 2002. 4U, essentially ubiquitous expression; RT-PCR and/or RACE positive in all or almost all of >10 tissues; tissues in brackets were negative; + tissues indicates tissue specificity; tissues with no designation indicates only tissues tested.
142
K. Gardiner et al. / Gene 318 (2003) 137–147
have open reading frames >40 amino acids, but because these are incomplete at the 5V ends, they may actually be noncoding if no initiator methionine exists upstream in the transcript. No mimimally conserved gene encodes identifiable motifs. Interpretation of biological significance of these genes is difficult. 3.4. Species-specific spliced transcripts/putative genes Species-specific spliced transcripts, classed as putative genes, are identified as multi-exon structures seen only in one organism and for which no exon shows identity >65% over >50 nucleotides within the orthologous genomic region. By these criteria, there are 111 human-specific and 38 mouse-specific putative genes. The 13 most centromere proximal human-specific transcripts, located distal to the first conservation with mouse chromosome 16, are multicopy in the human genome; they have been described as likely pseudogenes (Brun et al., in press) and are not discussed further here. For the remaining 98 human-specific and the mouse-specific transcripts, with few exceptions, BLASTN searches of human and mouse whole genome draft sequences do not reveal similarities elsewhere in either genome. The human-specific ABCC13 gene, mapping between the conserved genes, RBM11 and STCH, encodes a novel ATP-binding cassette, and shows no sequence similarity on mouse chromosome 16. The human-specific gene, TCP10L, located within 21q22.1, encodes similarity to the Tcomplex responder locus. For this gene, there is no homologous protein match, exon predictions or spliced ESTs in the orthologous mouse chromosome 16 genomic region, although there are family members elsewhere in both genomes. Mouse-specific proteins with function include the previously reported novel integrin, ITGB2L (Pletcher et al., 2001) and a third carbonyl reductase (CBR). CBR1 and CBR3 had been identified on human chromosome 21, located on the same strand 70 kb apart. In the homologous mouse segment, the novel CBR gene is located between CBR1 and CBR3, where a truncated, probable pseudogene for CBR is also found. The novel CBR gene has the same exon/intron organization as CBR1/3 and comprises a complete open reading frame. However, it may still be a pseudogene because it is not represented by any ESTs and it lacks association with a CpG island seen with both CBR1 and CBR3 in both human and mouse. Three human-specific complete cDNAs lacking any functional domains, C21orf49, DSCR9 and DSCR10, also have been experimentally verified (Gardiner et al., 2002; Takamatsu et al., 2002). The remaining species-specific spliced transcripts are based largely on spliced ESTs. As with minimally conserved
ESTs, the likelihood that these represent 5VUTRs (which are rarely conserved at the DNA sequence level) of characterized genes can be eliminated in considering orientation, location relative to CpG islands and predicted polyadenylation sites. Table 4 lists some structural and coding characteristics. Fifteen of the human-specific transcripts have no open reading frame and average of two to three exons. Fifty-seven have complete open reading frames, averaging 60– 90 amino acids and three to four exons. Eleven of these are >100 amino acids in length. Thirty-six have incomplete open reading frames averaging >80 to >115 amino acids and again comprised of more than three exons. These are not merely transcribed repetitive sequences; fewer than 10% of the open reading frames contain even partial repeat sequences. For those currently lacking orfs or with incomplete orfs, experiments using 5V RACE may certainly identify additional, coding exons. However, eight of the human-specific spliced ESTs lacking orfs are composed of three exons, and 24 with incomplete orfs are composed of three to nine exons; it is unusual for 3VUTRs to be multi-exonic (Pesole et al., 2001). Mouse-specific transcripts show similar exon distributions but slightly shorter open reading frames. Such large numbers of species-specific spliced transcripts were unexpected and their bone fides and biological significance are not clear. However, two sets of data suggest that further study is warranted. First, recent efforts to verify expression of a subset of these transcripts by RT-PCR and RACE have largely been successful: 30 of the 31 tested were shown to be transcribed and spliced in one or more tissues of a panel, and RACE has defined complete open reading frames (Table 5; Gardiner et al., 2002; Takamatsu et al., 2002; Reymond et al., 2001, 2002b; Gardiner and Fortna, unpublished data). The majority show wide spread expression, detectable by 40 cycles of PCR, which demonstrates that these are not unusually rare transcripts. Second, a search of the nonhuman primate genomic sequence database (NCBI) was conducted. This database consists largely of short, ‘‘sample’’ sequences less than 1 kb in length; each fragment is therefore expected to contain at most a single exon. The database was first searched with cDNA sequences for the 170 human genes that have orthologues in mouse. These genes comprise a total of approximately 1700 coding exons, of which 52 (f 3%) were found in the nonhuman primate database, deriving from a total of 40 genes. In each case, consensus splice sites were seen and the match to human chromosome 21 was the best match in the human genome. A similar search of the database with the cDNAs from the 98 nonconserved cDNAs comprising approximately 250 exons identified four exons deriving from
Fig. 1. Antisense gene structures. Twenty-three human (#s1 – 23) and eight mouse (#s24 – 31) antisense genes were identified. Sense genes symbols are indicated as in Supplementary Table 3; antisense genes are indicated by dbEST accession number or clone name. The genomic sequence is indicated by the solid black line; circles indicated CpG islands. Sense gene exons are numbered (larger genes are only partially represented); coding and UTR exons are shown by black and open boxes, respectively. In antisense genes, exons with complementarity to sense gene exons are shaded. Alternative splicing is indicated by lines joining appropriate exons.
K. Gardiner et al. / Gene 318 (2003) 137–147
143
144
K. Gardiner et al. / Gene 318 (2003) 137–147
the genes C21orf23, 24 and 27, and the EST AI275231. Added to these are nonconserved exons from C21orf9 and BI767541, and the DSCR9 and DSCR10 genes previously reported to be conserved in chimp but not in mouse (Takamatsu et al., 2002). Clearly the nonhuman primate genomic database is far from comprehensive. Nevertheless, sequences similar to those of exons of genes not conserved in mouse are being found and they demonstrate the characteristics of exons. 3.5. Antisense transcripts Antisense transcripts were defined as multi-exon gene models where one or more exons were partially or entirely complementary to one or more exons of a sense gene. The presence of consensus splice sites defines the strands of the sense and antisense genes. Twenty-three antisense examples were identified in human and eight in mouse, with three genes, LSS, C21orf33/HES1 and SLC19A1, being associated with antisense transcripts in both organisms. Antisense transcripts were classed as conserved or minimally conserved. Structures of the sense and antisense genes are shown in Fig. 1. In three human pairs, SON and C21orf60, IFNGR2 and C21orf4, and COL18A1 and SLC19A1, the antisense transcription arises from novel alternative splicing within the 3V end of a second known gene. Functional associations of each of the sense genes are given in Supplementary Table 3 and clearly indicate diversity. Antisense genes are most often identified by a single spliced EST, encode short open reading frames, and are lacking functional associations. Transcription and splicing of several antisense genes, AS-MCM3AP, AS-SON, AS-IFNGR, AS-C21orf60 and AS-KIAA0179, have been verified by RT-PCR and sequencing (data not shown).
genomic regions of chromosomes 16, 17 or 10. Only 94, fewer than 60%, of these are trisomic in Ts65Dn. Included among the many not represented in the mouse models are genes likely to function in cell cycle progression (BTG3, GANP/MCM3AP and PCNT2), a putative calcium channel (TRPC7), the Ca-binding protein (S100h), a protein with ‘‘epilepsy-associated repeats’’ (C21orf29/TSPEAR), a heat shock stress protein (STCH), a heat shock transcription factor binding protein (HSF2BP), the estrogen/glucocorticoid receptor interacting protein (RIP140) and a homeobox protein (PKNOX1). Overexpression of any of these (and others) can be hypothesized to be relevant to a neurological phenotype. A total of 80 genes are annotated as minimally conserved and some may encode short open reading frames or may be candidates for functional RNAs. Membership in this category is provisional because experimental analysis to complete the gene structures in both organisms may identify additional coding exons that allow reassignment of a gene to the highly conserved category. Alternatively, experiments may fail to validate the gene in one organism; this would support reassignment as a species-specific gene. 4.2. Functionally related genes
We have compared the kinds and number of genes and putative genes present on human chromosome 21 with those present in the orthologous mouse genomic regions. This allows us to evaluate the strengths and weaknesses of current and possible mouse models of Down syndrome based on three criteria: the number and kinds of conserved genes that are not represented in the segmental trisomies, sets of functionally related conserved genes that are not completely represented in the segmental trisomies, and sets of species-specific spliced transcripts representing putative genes. Gene contents of the Ts65Dn and Ts1Cje mouse models are summarized in Table 2.
Several sets of chromosome 21 and mouse genes that participate in common pathways or cellular processes have been identified. These genes and their chromosome location in mouse and in the segmental trisomies are shown in Fig. 2 (see also Supplementary Table 3). Seven genes have roles in RNA processing, functioning in constitutive splicing, editing and regulation of alternative splicing. Even if individual gene contributions are modest, the cumulative changes may have significant consequences for overall protein isoform patterns and therefore protein function. Notably, only two of these genes are trisomic in Ts65Dn/Ts1Cje. Four more are not even within mouse chromosome 16; one maps to chromosome 17 and three to chromosome 10. Thus, creating a mouse model for investigating Down syndromerelated alternative splicing defects is not straightforward. Similarly, sets of four genes involved in the proteasome pathway; six involved in carbon metabolism and methylation, and five with roles in mitochondrial function, each map to three different regions. Five genes that function in tight junctions also map to three different regions. In each of these cases, the contribution to the Down syndrome phenotype cannot be replicated in current mouse models. As more is learned about chromosome 21 gene functions, the involvement of additional pathways and complexes will likely be revealed.
4.1. Representation of conserved genes in mouse models
4.3. Species-specific spliced transcripts/putative genes
One hundred seventy genes identified within human chromosome 21 and representing complete or near-complete cDNAs have homologues in the corresponding mouse
The original annotation of the finished sequence of human chromosome 21 identified 225 genes and putative genes, many of which were based solely on spliced ESTs.
4. Discussion
K. Gardiner et al. / Gene 318 (2003) 137–147
145
Fig. 2. Localizations of genes functioning in common pathways. Schematics of human chromosome 21 and orthologous mouse chromosomal regions are indicated. The mouse chromosome 16 region is further divided into the three segments: absent in Ts65Dn; present in Ts65Dn but absent in Ts1Cje; and present in both Ts5Dn and Ts1Cje. Gene functions are given in Supplementary Table 3.
Several recent reports (Gardiner et al., 2002; Takamatsu et al., 2002; Reymond et al., 2001, 2002a,b; Gitton et al., 2002) have revised this gene list, deleting or combining some models and adding new models with verified expression patterns. Comparative analysis of human chromosome 21 with mouse genomic sequences shows that 23 of the original 225 and 17 of the recently published new chromosome 21 gene models have no sequence conservation of any exon within the mouse genome. The updated annotations presented here add a further 52 human-specific spliced ESTs and five coding exon prediction models. Similar annotation of homologous regions of the mouse genome identified 36 spliced ESTs composed of mouse-specific sequences. Several points are noteworthy. (i) Locations and strand specificities of these ESTs generally eliminate them as candidates for 5VUTRs of adjacent characterized genes. (ii) Even though there are gaps in the mouse genomic sequence, which may contain orthologues of some of the putative human-specific sequences, it is unlikely that they are so large, so frequent and so precisely placed that they delete >90 genes without deleting even fragments of the >150 highly conserved genes. In addition, the human chromosome 21 genomic sequence is high quality, finished, sequence (Hattori et al., 2000), arguing against gaps deleting the mouse-specific genes. (iii) The species-specific transcribed sequences are unique within their respective genomes and are interspersed with conserved sequences that maintain the previously established human –mouse syntenies. Thus, these data do not represent pseudogenes and do not imply new regions of genomic conservation between human chromosome 21 and mouse chromosomes other than 16, 17 and 10. (iv) Human
chromosome 21 also does not represent a unique gene organizational or evolutionary feature of the human genome; similar comparative analyses of 2 Mb regions of human chromosomes 10 and 13, surrounding the ADAR3 and Rb genes, respectively, identified nine and six spliced ESTs lacking conservation in mouse genomic sequence (data not shown). These densities are comparable to that seen on human chromosome 21 (f 100 within < 35 Mb). There are, however, regional variations; within 2 Mb surrounding the Utrophin gene on chromosome 6, only a single humanspecific spliced EST was found, and in the very gene-rich region surrounding ADAR1 on chromosome 1, no humanspecific spliced ESTs were seen. It has been expected that the gene content of mammalian species will be essentially uniform and, if speciesspecific genes exist, they will represent a small percentage of the total, not the f 30% observed here (97 of 360 genes in human chromosome 21). The apparent species specificity of the sequences of these spliced ESTs thus raises questions as to their biological significance. The following points should be considered. 4.3.1. Annotation of species-specific genes There is as yet no consensus on how much attention is to be given to potentially species-specific genes. Mural et al. (2002) in their comparison of the gene content of mouse chromosome 16 with orthologous human chromosomal regions failed to observe putative species-specific transcripts because their analysis did not include gene models supported only by a single criterion, such as spliced ESTs. The Mouse Genome Sequencing Consortium
146
K. Gardiner et al. / Gene 318 (2003) 137–147
(Waterston et al., 2002) made a similar decision in comparative analysis of the complete draft sequence of the mouse genome. In contrast, in analysis of the mouse transcriptome, the Fantom Consortium and Riken Groups (Okazaki et al., 2002) analyzed all mouse cDNAs available, without discussing conservation, and indeed noted that in one subset of 4280 mouse cDNAs (potential functional RNAs) only 454 identified similar sequences in the human genome. Similarly, Gitton et al. (2002) and Reymond et al. (2002a,b) did not exclude apparently human-specific genes from their annotations of chromosome 21, although they clearly could not be included in the mouse expression analyses. In a recent comparison of f 1.3 Mb of human chromosome 21 with homologous mouse chromosome 16, Toyoda et al. (2002) used the lack of conservation, coupled with relatively low levels of expression, to justify the elimination from the chromosome 21 gene list of two previously reported protein-coding genes, DSCR4 and DSCR8. It is indeed true that many species-specific spliced ESTs are largely represented by only one or a few dbEST entries; however, frequency in dbEST does not reliably indicate expression level. Indeed, when tested, the human-specific ESTs here are frequently seen to be expressed in diverse tissues, under RT-PCR conditions that do not suggest extraordinarily low levels of transcription. 4.3.2. Noise in the transcriptome? Kapranov et al. (2002) recently used chromosome 21 and 22 genomic sequence oligonucleotide arrays to demonstrate that as much as 10-fold more genomic sequence appears in the cytoplasmic polyadenylated RNA of a number of tissue culture cell lines than is accounted for by exons of currently annotated genes. The novel transcribed sequences derive both from introns and from putative intergenic regions, and of those tested, 2/3 could be verified by RT-PCR and half by Northern analysis; these latter in particular do not suggest low levels of expression. The function of these transcripts is not known; possibly they merely represent noise in the transcriptome, possibly even noise that is specific to cultured cells or cancer-derived cultured cells. It can be argued that the species-specific transcripts described here also represent noise, albeit noise that shows tissue specificities, and that is also spliced and most often encodes open reading frames (features that were not assessed in the data of Kapranov et al. (2002)). Even if noise, or originally noise, it cannot be excluded that evolution might subvert noise to practical purposes, at least in some cases. 4.3.3. Primate-specific genes? A proportion of the genes currently annotated as humanspecific may finally be classed as primate-specific. The occurrence of transcribed, spliced, primate-specific sequences suggests the need both for careful consideration of the definition of a gene and for further analysis of potential genes that meet only minimal requirements. Indeed, there
are no data to suggest that gene models based solely on spliced ESTs are systematically artifactual; the goal of a complete gene catalogue would argue for being inclusive of potential gene models rather than arbitrarily exclusive. With the limitations of current information, it is premature to conclude that mammalian genomes cannot harbor some significant proportions of species-species genes transcribed at modest levels. Of the 98 human-specific spliced transcripts unique to chromosome 21, 57 encode complete open reading frames >40 amino acids, and 11 of these are greater than 100 amino acids in length. While the nonhuman primate database is not yet comprehensive, already there is evidence for the presence of some of these genes. A search for translation products for at least a subset of these could be done with significant sensitivity using mass spectrometry and selective ion searches. Similar analyses would be appropriate in cases where RACE techniques demonstrate complete open reading frames in any of the additional 35 spliced transcripts currently lacking initiator methionines. 4.3.4. Functional RNAs Genes for which 5VRACE analysis fails to confirm open reading frames may indeed be functional RNA genes. For many known functional RNAs, sequence conservation is not detectable by BLASTN analysis because conservation is in short interspersed patterns. Conservation may, however, be detectable at the level of secondary structures. While this will not determine function, it will confirm the gene’s presence in a mouse model. 4.4. Summary Down syndrome is a contiguous gene syndrome spanning f 33 Mb and >160 orthologous genes. Modeling Down syndrome in mouse has unique complexities due to the size of the region involved and the number of genes to be considered as candidates. Annotation of the gene content of mouse genomic regions orthologous to human chromosome 21 increases the complexity of the problem in two respects. First, all genes functioning in a common pathway or complex need to be trisomic in a single mouse model if the contribution of this pathway to the Down syndrome phenotype is to be assessed. Second, humanspecific genes will be absent in any model, possibly masking DS-relevant phenotypic features, and mouse-specific genes will potentially add phenotypic features irrelevant to Down syndrome. The dispersion of genes in each of these classes throughout chromosome 21 and three mouse chromosomes shows that current mouse models potentially have significant limitations. Mouse, of course, remains the model organism of choice. However, it will be necessary to engineer additional models to complete pathway analyses and to understand the roles of speciesspecific genes if efficient correlations with human trisomy 21 are to be made.
K. Gardiner et al. / Gene 318 (2003) 137–147
Acknowledgements This work was supported by grants from the National Institutes of Health (HD17449), the Fondation Jerome Lejeune, the National Down Syndrome Society, and the Human Medical Genetics Program of the University of Colorado Health Sciences Center. References Akeson, E.C., Lambert, J.P., Narayanswami, S., Gardiner, K., Bechtel, L.J., Davisson, M.T., 2001. Ts65Dn-localization of the translocation breakpoint and trisomic gene content in a mouse model for Down syndrome. Cytogenet. Cell Genet. 93, 270 – 276. Bonaldo, M.F., Lennon, G., Soares, M.B., 1996. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res. 6, 791 – 806. Brun, M.E., Ruanlt, M., Ventura, M., Roizes, G., DeSario, A., 2003. Juxtacentromeric region of human chromosome 21: a boundary between centromeric heterochromatin and euchromatin chromosome arms. Gene 312, 41 – 50. Camargo, A.A., Samaia, H.P., Dias-Neto, E., Simao, D.F., Migotto, I.A., Briones, M.R., Costa, F.F., Nagai, M.A., Verjovski-Almeida, S., Zago, M.A., et al., 2001. The contribution of 7,000,000 ORF sequence tags to the definition of the human transcriptome. Proc. Natl. Acad. Sci. U. S. A. 98, 12103 – 12108. Davisson, M.T., Costa, A.C.S., 1999. Mouse models of Down syndrome. In: Popko, B. (Ed.), Mouse Models in the Study of Genetic Neurological Disorders. Advances in Neurochemistry, vol. 9. Plenum, New York, pp. 297 – 327. Epstein, C.J., 1995. Down syndrome (trisomy 21). In: Scriver, C.A., et al. (Eds.), Metabolic and Molecular Bases of Inherited Disease. McGrawHill, New York, pp. 749 – 794. Fortna, A., Gardiner, K., 2001. Genomic sequence analysis tools: a user’s guide. Trends Genet. 17, 158 – 164. Gardiner, K., Slavov, D., Bechtel, L., Davisson, M., 2002. Annotation of human chromosome 21 for relevance to Down syndrome: gene structure and expression analysis. Genomics 79, 833 – 843. Gitton, Y., Dahmane, N., Baik, S., Ruiz i Altaba, A., Neidhardt, L., Scholze, M., Hermann, B.G., Kahlem, P., Benkhala, A., Schrinner, S., Yildirimann, R., Herwig, R., Lehrach, H., Yaspo, M.-L., HSA21 Expression Map Initiative, 2002. A gene expression map of human chromosome 21 orthologues in the mouse. Nature 420, 586 – 590. Harris, N.L., 1997. Genotator: a workbench for sequence annotation. Genome Res. 7, 754 – 761. Hassold, T.J., Jacobs, P.A., 1984. Trisomy in man. Annu. Rev. Genet. 18, 69 – 97. Hattori, M., Fujiyama, A., Taylor, T.D., Watanabe, H., Yada, T., Park, H.-S., Toyoda, A., Ishii, K., Totoki, Y., et al., 2000. The sequence of human chromosome 21. Nature 405, 311 – 319. Kapranov, P., Cawley, S.E., Drenkow, J., Bekrianov, S., Strausberg, R.L., Fodor, S.P.A., Gingeras, T.R., 2002. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916 – 919. Kikuno, R., Nagase, T., Waki, M., Ohara, O., 2002. HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project. Nucleic Acids Res. 30, 166 – 168.
147
Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpern, A., Li, P.W., Sutton, C.G., et al., 2002. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661 – 1671. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., et al., Fantom Consortium, RIKEN Genome Exploration Research Group Phase I and II Team, 2002. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563 – 573. Pennington, B.F., Moon, J., Edgin, J., Stedron, J., Nadel, L., 2003. The neuropsychology of Down syndrome: evidence for hippocampal dysfunction. Child Dev. 74, 75 – 93. Pesole, G., Mignone, F., Gissi, C., Grillo, G., Licciulli, F., Liuni, S., 2001. Structural and functional features of eukaryotic mRNA untranslated regions. Gene 276, 73 – 81. Pletcher, M.T., Wiltshire, T., Cabin, D.E., Villanueva, M., Reeves, R.H., 2001. Use of comparative physical and sequence mapping to annotate mouse chromosome 16 and human chromosome 21. Genomics 74, 45 – 54. Reymond, A., Friedli, M., Henrichsen, C.N., Chapot, F., Deutsch, S., Ucla, C., Rosier, C., Lyle, R., Guipponi, M., Antonarakis, S.E., 2001. From PREDs and open reading frames to cDNA isolation: revisiting the human chromosome 21 transcription map. Genomics 78, 46 – 54. Reymond, A., Marigo, V., Yaylaoglu, M.B., Leoni, A., Ucla, C., Scamuffa, N., Caccioppoli, C., Dermiktzakis, E.T., Lyle, R., Banfi, S., Eichele, G., Antonarakis, S.E., Ballabio, A., 2002a. Human chromosome 21 gene expression atlas in the mouse. Nature 420, 582 – 586. Reymond, A., Camargo, A.A., Deutsch, S., Stevenson, B.J., Parmigiani, R.B., Ucla, C., Bettoni, F., Rossier, C., Lyle, R., Guipponi, M., de Souza, S., Iseli, C., Jongeneel, C.V., Bucher, P., Simpson, A.J., Antonarakis, S.E., 2002b. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79, 824 – 832. Sago, H., Carlson, E.J., Smith, D.J., Kilbridge, J., Rubin, E.M., Mobley, W.C., Epstein, C.J., Huang, T.T., 1998. Ts1Cje, a partial trisomy 16 mouse model for Down syndrome, exhibits learning and behavioral abnormalities. Proc. Natl. Acad. Sci. U. S. A. 95, 6256 – 6261. Sago, H., Carlson, E.J., Smith, D.J., Rubin, E.M., Crnic, L.S., Huang, T.-T., Epstein, C.J., 2000. Genetic dissection of region associated with behavioral abnormalities in mouse models for Down syndrome. Pediatr. Res. 48, 606 – 613. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., Miller, W., 2000. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10, 577 – 586. Takamatsu, K., Maekawa, K., Togashi, T., Choi, D.K., Suzuki, Y., Taylor, T.D., Toyoda, A., Sugano, S., Fujiyama, A., Hattori, M., et al., 2002. Identification of two novel primate-specific genes in DSCR. DNA Res. 9, 89 – 97. Tam, W., 2001. Identification and characterization of human BIC, a gene on chromosome 21 that encodes a noncoding RNA. Gene 274, 157 – 167. Toyoda, A., Noguchi, H., Taylor, T.D., Ito, T., Pletcher, M.T., Sakaki, Y., Reeves, R.H., Hattori, M., 2002. Comparative genomic sequence analysis of the human chromosome 21 down syndrome critical region. Genome Res. 12, 1323 – 1332. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., et al., Mouse Genome Sequencing Consortium, 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520 – 562.