Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
From PREDs and Open Reading Frames to cDNA Isolation: Revisiting the Human Chromosome 21 Transcription Map Alexandre Reymond,1 Marc Friedli,1,* Charlotte Neergaard Henrichsen,1,* Fabian Chapot,1,* Samuel Deutsch,1,2 Catherine Ucla,1 Colette Rossier,1 Robert Lyle,1 Michel Guipponi,1 and Stylianos E. Antonarakis1,† 1 Division of Medical Genetics, University of Geneva Medical School, 1211 Geneva, Switzerland Graduate Program of Molecular and Cellular Biology, University of Geneva Medical School, 1211 Geneva, Switzerland
2
*These authors contributed equally to this work. †To whom correspondence and reprint requests should be addressed. Fax: 0041227025706. E-mail:
[email protected].
A supernumerary copy of human chromosome 21 (HC21) causes Down syndrome. To understand the molecular pathogenesis of Down syndrome, it is necessary to identify all HC21 genes. The first annotation of the sequence of 21q confirmed 127 genes, and predicted an additional 98 previously unknown “anonymous” genes (predictions (PREDs) and open reading frames (C21orfs)), which were foreseen by exon prediction programs and/or spliced expressed sequence tags. These putative gene models still need to be confirmed as bona fide transcripts. Here we report the characterization and expression pattern of the putative transcripts C21orf7, C21orf11, C21orf15, C21orf18, C21orf19, C21orf22, C21orf42, C21orf50, C21orf51, C21orf57, and C21orf58, the GC-rich sequence DNA-binding factor candidate GCFC (also known as C21orf66), PRED12, PRED31, PRED34, PRED44, PRED54, and PRED56. Our analysis showed that most of the C21orfs originally defined by matching spliced expressed sequence tags were correctly predicted, whereas many of the PREDs, defined solely by computer prediction, do not correspond to genuine genes. Four of the six PREDs were incorrectly predicted: PRED44 and C21orf11 are portions of the same transcript, PRED31 is a pseudogene, and PRED54 and PRED56 were wrongly predicted. In contrast, PRED12 (now called C21orf68) and PRED34 (C21orf63) are now confirmed transcripts. We identified three new genes, C21orf67, C21orf69, and C21orf70, not previously predicted by any programs. This revision of the HC21 transcriptome has consequences for the entire genome regarding the quality of previous annotations and the total number of transcripts. It also provides new candidates for genes involved in Down syndrome and other genetic disorders that map to HC21. Key Words: human chromosome 21, transcription map, genomic sequences annotation, gene prediction, gene prediction, C21orf, Down syndrome, trisomy 21
INTRODUCTION Trisomy 21 causes Down syndrome (DS), which is the main genetic cause of mental retardation. It affects approximately 1 live-born child in 700 [1]. To understand the molecular pathogenesis of DS, it is necessary to identify all of the human chromosome 21 (HC21) genes. The sequences of the human and other eukaryotes’ genomes and of expressed sequence tags (ESTs) facilitate this cataloging [2–8]. The initial annotation of the complete sequence of the long arm of HC21 showed 127 genes and 98 predicted transcripts. HC21 genes and predicted transcripts were identified on the basis
46
of identity/similarity to known proteins, identity to spliced ESTs, and/or consistent exon/intron prediction [5,9,10]. To evaluate the quality of the HC21 gene prediction, we characterized a set of open reading frames (C21orfs) defined by matching spliced ESTs and PREDs (predictions) defined solely by computer prediction ([5], HC21 gene catalog, category 4). We report here the determination of the human and mouse sequence of these gene models, detailed information about their pattern of expression and of the alternatively spliced transcripts. Our data show that thorough and complete annotation of HC21 will require a single gene-based approach and cannot be based solely on exon prediction.
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved. 0888-7543/01 $35.00
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
C21orf11
PRED44
AJ409094/AF375989/AY040086
AF360358
C21orf15
AY040090
C21orf18
NM_017438, AF391112
AY037804
C21orf19
AF363446
AF363447
C21orf22
AY040089
C21orf42
AY035382-383
C21orf50
SON
+
142
142
90
TAB2-bdg of TAK1
235 (274)
235
78
2-19
440
439
77
Y92H12BR.6
297
297
99
C37C3.8
55
83
441
440
87
SUEL
F32A7.3A
917
> 855
95
GCF2
F43G9.12
273
92
Layilin
215 (230)
219
76
240
240
80
> 110 +
+
AF380179-184
+
81 +
AY033902
C21orf57
AY040873-876
+
> 299
C21orf58
AY039243-244
+
239 (194)
PRED34/SUE21
C21orf66
AY033901
2426
C21orf51
C21orf63
C. elegansh
AY033899
Homologiesg
AY033900
C21orf7
Identity [%]f
Mm accessionc
Mm size [a.a.]e
Hs accessionb
Hs size [a.a.]e
Aliasesa
Mm isoformsd
Genea
Hs isoformsd
TABLE 1: C21orfs
58
AF358258/AY040087
AF358257
+
AY033903-906
AY033907-908
+
C21orf67
PRED54
AF380178/AY040088
C21orf68
PRED12
AK022689
+
C21orf69
PRED54
AY035381
C21orf70
PRED56
AF391113-114
AF391115
HSPC230
PRED31
AF151064
AK006199
AK014255
+
D111 and dsRBD
204 273 163
+
a
Nomenclature committee approved gene symbol and aliases. b Homo sapiens GenBank accession(s) number(s). c Mus musculus GenBank accession(s) number(s). d The (+) signs specifies the identification of multiple H. sapiens and/or M. musculus isoforms. e Size of the encoded peptides in number of amino acid residues. f Percentage of identity at the amino acid level between H. sapiens and M. musculus encoded peptide. g Identified homologies, see text for details. h Name of homologous gene in C. elegans.
RESULTS To identify candidate genes for DS phenotypes and to contribute to the update of the HC21 annotation, we are systematically sequencing HC21 ESTs and their mouse orthologues that correspond to uncharacterized genes. Here we report the isolation and characterization of a set of 18 predicted gene models first reported in the annotation of the complete sequence of HC21 [5]. To avoid any bias in our results, we randomly selected these gene candidates. We sequenced human and/or mouse ESTs matching C21orf7, C21orf11, C21orf15, C21orf18, C21orf19, C21orf22, C21orf42, C21orf50, C21orf51, C21orf57, and C21orf58, PRED12, PRED31, PRED34, PRED44, PRED54, and PRED56, and the GC-rich sequence DNA-binding factor candidate GCFC (also known as C21orf66) gene models. When the sequence of the mouse cDNA was first determined, the sequence of its human orthologue was inferred from the HC21 genomic
sequence and its existence was confirmed by RT-PCR (C21orf7, C21orf19, PRED34, and PRED56, described below). Most results (mapping position, number of exons, human gene size, accession numbers of human and mouse cDNA and of human genomic sequence, presence of alternatively spliced isoforms, human and mouse encoded peptide size, sequence comparison, homologies, and result of model refinement) are summarized in Tables 1 and 2 and are described briefly below. The human genomic structures and intron–exon junctions are available (http://www.medgen. unige.ch/). This work allowed identification of new complementary single-nucleotide polymorphisms (cSNPs), which are added directly to the HC21 cSNP database (http://csnp.unige.ch/) [11]. C21orf7 Mouse mC21orf7 contains an ORF of 429 nucleotides and encodes a predicted protein of 142 residues (GenBank acc. no. AY033899) and is 90% identical to its human
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
47
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
andb
genomic seqc
from [bp]d
to [bp]d
exons [n]e
Hs gene size [kb]f
C21orf7
21q22.11
CCT8
BACH1
AL163249
149171
190718
4
42
none
C21orf11
21q22.3
BACE2
MX2
AL163285
125073
165679
9
40
combines two models in one
C21orf15
21q11
centromere
STCH
AL163204
206914
201692
≥6
≥6
none
C21orf18
21q22.13
RUNX1
CBR1
AP001724
296434
270591
11
25
none
8
?
unspliced pseudogene
ANKRD3
130076
131220
≥3
>2
none
Genea
Hs maps to
betweenb
TABLE 2: C21orfs mapping, gene structure, and model refinement
NT_005194
Gene model refinementf
C21orf19
2
C21orf22
21q22.3
TMPRSS2
C21orf42
21q21.3
NCAM2
VE-JAM
AP001693
177747
131873
4
46
none
C21orf50
21q22.11
GART
CRYZL1
AP001717
152018
186435
15
35
combines two models in one
C21orf51
21q22.12
KCNE2
KCNE1
AP001719
310376
319646
3
10
none
C21orf57
21q22.3
MCM3
PCNT
AP001759
304112
315450
≥5
≥ 12
none
AP001743
C21orf58
21q22.3
MCM3
PCNT
AP001759
335934
318321
8
18
none
C21orf63
21q22.11
KIAA0539
TCP10L
AP001714
38796
141134
8
103
none
C21orf66
21q22.11
SYNJ1
PRKCBP2
AP001715
58613
21895
20
37
larger gene
C21orf67
21q22.3
ITGB2
ADARB1
AL163300
316517
311400
4
6
wrong prediction/gene model position split in two
C21orf68
21q21.1
BTG3
PRSS7
AP001672
196597
219119
6
23
none
C21orf69
21q22.3
ITGB2
ADARB1
AL163300
311645
309486
2
3
wrong prediction/gene model position split in two
C21orf70
21q22.3
ITGB2
ADARB1
NT_011515
1672688
1709627
6
37
wrong prediction
HSPC230
6
4
23
pseudogene on HC21
AL355586/AC016916
a
Committee approved gene symbol and aliases. Gene names as specified in [5]. c H. sapiens genomic sequence GenBank accession numbers. d Gene start and end positions. e Number of exons and size in kilobases of the H. sapiens gene. f Gene model refinement, see text for details. b
counterpart at the amino acid level (GenBank acc. no. AY033900). C21orf7 is highly homologous to the TAB2 binding domain of Xenopus laevis TAK1 (54% identity (74 of 138) and 67% similarity (93 of 138)). TAK1 is a member of the MAPKKK family and is activated by many cytokines, whereas TAB2 has been shown to be an adapter linking TAK1 and TRAF6 and mediating TAK1 activation in response to interleukin 1 [12–14]. The homology of C21orf7 to TAK1 indicates that the HC21 gene might encode a peptide involved in the interleukin 1 signaling pathway. C21orf11/PRED44 We identified three human alternatively spliced isoforms: C21orf11 form A, encoding a 274-residue predicted peptide (GenBank acc. no. AJ409094); C21orf11 form B, missing exon 2 (GenBank acc. no. AF375989); and C21orf11 form C,
48
missing exons 2 and 3 (GenBank acc. no. AY040086). The gene spans both the C21orf11- and PRED44-predicted gene models [5]; however, only two of the five PRED44-predicted exons are present in the C21orf11 cDNA. The human C21orf11 form B and mouse mC21orf11 contain ORFs of 708 nucleotides and encode predicted proteins of 235 residues (GenBank acc. nos. AF360358 and AF375989, respectively). C21orf11 is 78% identical to its mouse counterpart at the amino acid level. C21orf11 is a member of the nonfunctionally characterized 2-19 protein family, a group of proteins present only in mammals. To determine where C21orf11 might exert its function, fusion proteins were expressed in Cos7 and U2OS cells. C21orf11 form A localizes in a discrete vesicular and perinuclear structure (Figs. 1A–1C, and data not shown). Control experiments with green fluorescent protein (GFP) empty vectors showed a
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
Article
FIG. 1. Intracellular localization of 5⬘ EGFP-C21orf11 form A fusion protein. Fluorescence microscopy analysis is shown of fixed cos7 cells transiently transfected with 5⬘ EGFP-C21orf11 expression vector. EGFP-C21orf11 fluorescence (A–D, green signal) was counterstained with 4,6-diamidino-2-phenylindole (DAPI; C, blue signal) and anti--tubulin antibody (D, red signal).
mainly nuclear staining pattern, as previously published (data not shown). To better define the compartments identified by the C21orf11 protein, we studied colocalization using compartment-specific markers. The results show that the C21orf11 protein does not colocalize with the tested structures (microtubules, endoplasmic reticulum, and Golgi; Fig. 1D and data not shown). C21orf15 The partial human C21orf15 cDNA encodes a potential ORF of more than 110 residues at its carboxy end (GenBank acc. no. AY040090). Multiple attempts to clone the 3⬘ untranslated
region (UTR) of the gene by rapid amplification of cDNA ends (RACE) have failed. C21orf18 Mouse mC21orf18 contains an ORF of 1320 nucleotides and encodes a putative protein of 439 residues (GenBank acc. no. AY037804). During the preparation of this manuscript, its human counterpart was cloned and sequenced by the New Energy and Industrial Technology Development Organization (NEDO) human cDNA sequencing project (C21orf18 spliced form A, GenBank acc. no. NM_017438). The two encoded peptides are 77% identical at the amino acid level. We identified
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
49
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
+
Thymus Stomach
+
Colon
+
+
+
+
+
+
+
+ +
+
+ +
+
+
Ovary
+
+
+
+ +
+ +
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+ (+)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
Fetal brain
+
+
Fetal heart
+
Fetal lung
+ +
+
+
+ +
+
+
+ +
+
+
+ +
+
+
+
+
+
+
+
+
Fetal kidney
+
+
+
+
+
+ +
+
8.5 day emb.
+ +
+
+
Bone marrow
Fetal liver
+
+
+
+
+
Eyes
+
+
+
Placenta Skin
+
+ +
+
+
+
Testis
+
+
+
+
Muscle
Hs C21orf70
+
+
Hs C21orf69
+
Mm C21orf68
+
Hs C21orf68
+
Hs C21orf67
+
+
Mm C21orf66 +
+
+
Sm. intestine
PBLs
+
+
Hs C21orf66
Hs C21orf58
Hs C21orf57 +
+
+ +
Liver
Lung
Hs C21orf42
+
Spleen
+ +
+
Mm PRED31
+
+
Hs PRED31
Kidney
+
Mm C21orf63
+
Hs C21orf63
+
+
Mm C21orf51
+
Heart
Hs C21orf51
Brain
Hs C21orf22
Hs C21orf15
Mm C21orf11
Hs C21orf11
Mm C21orf7
Hs C21orf7
TABLE 3: C21orfs expressiona
+
+
+
+
+
+
+
+ +
+ +
+ +
+
+
+
9.5 day emb.
+
+
+
+
+
+
12.5 day emb.
+
+
+
+
+
+
19 day emb.
+
+
+
+
+
+
+
a
Grayed areas specify untested tissues/developmental stages.
an alternatively spliced isoform in humans missing exon 4 (C21orf18 form B, GenBank acc. no. AF391112). C21orf19 Mouse mC21orf19 and human C21orf19 contain ORFs of 894 nucleotides and encode predicted proteins of 297 residues (GenBank acc. no. AF363446 and AF363447). The human gene maps to chromosome 2 (HC2) (clone RP11-437L5, GenBank acc. no. AC022481, and clone RP11-431P19, GenBank acc. no.
50
AC010981) and contains eight exons, whereas an unspliced copy is present on HC21. The HC21 copy shows 12 mismatches and deletion of two closely positioned stretches of 4 and 2 nucleotides, respectively, over 1762 nucleotides compared with the HC2 C21orf19 cDNA. These changes do not create any in-frame termination codon, but the following observations support the hypothesis that the HC21 copy is a pseudogene. All 49 human ESTs deposited in dbEST (EST database) correspond to the HC2 C21orf19 copy.
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
Moreover, the mouse orthologue is identical to the HC2 copy in four amino acid residues that are different between the HC2 and the HC21 copies. Finally, the loss of two codons plus the resulting change of seven codons in the HC21 copy are not present in the mouse gene. We cannot, however, totally rule out the possibility that the HC21 copy corresponds to an extremely rare transcript, encoded by an intronless gene. The C21orf19-encoded peptide is a member of a highly conserved family of proteins containing the aminoacyl-transfer RNA synthetases class-II signature 2 (PROSITE). This group of proteins includes Drosophila melanogaster YC4P, Caenorhabditis elegans C37C3.8, Arabidopsis thaliana 2G2520, Dictyostelium discoideum 2034, Schizosaccharomyces pombe C4H3.04C, Saccharomyces cerevisiae YJR83.9, and Pyrococcus abyssi PAB0370. C21orf22 The partial human C21orf22 cDNA encodes a potential ORF at its 5⬘ end (GenBank acc. no. AY040089). Multiple attempts to clone the 5⬘ UTR of the gene by RACE have failed. C21orf42 We have identified two C21orf42 alternatively spliced forms encoding exons 1, 2, 4, and 5 (C21orf42 form A, GenBank acc. no. AY035382) and exons 1–5 (C21orf42 form B, GenBank acc. no. AY035383), respectively. Both contain an ORF of 246 nucleotides and encode a putative protein of 81 residues. The translocation t(7;21)(q21.2;q22), found in a patient with splenic marginal zone lymphoma, juxtaposes genomic DNA 66 kb upstream of the transcription start site of CDK6 (cyclindependent kinase 6) and the C21orf42 gene [15], probably causing dysregulation of the expression of the kinase. C21orf50/SON Full-length sequencing of mouse ESTs corresponding to C21orf50 showed that this predicted ORF and the published mouse orthologue of human SON/KIAA1019 cDNA overlap. We therefore reanalyzed all human ESTs and cDNAs corresponding to both the C21orf50 gene model and SON and confirmed our prediction by RT-PCR. We determined that the longest SON ORF spans over 7281 bp and encodes a predicted protein of 2426 residues, rather than the previously reported smaller predicted protein of 1929 amino acid residues [16–18]. We characterized six alternatively spliced isoforms, SON forms A–F, and revised the genomic structure of the gene (GenBank acc. no. AF380177–184). All intron–exon junctions conform to the donor–acceptor consensus sequence (http://www.medgen.unige.ch/). Exon 3 is remarkably large, with a size of 5916 bp. The SON protein contains both a D111 and a double-stranded ribosomebinding domain (dsRBD) (residues 2311–2348 and 2371–2408, respectively, numbered according to SON form F, GenBank acc. no. AF380184). The dsRBD proteins (also known as double-stranded RNA-binding motif (DSRM) proteins) specifically recognize double-stranded RNA [19,20]. They are involved mainly in post-transcriptional gene
Article
regulation by preventing protein expression or by mediating RNA localization. The D111 or G-patch domain is a short conserved region of about 40 amino acids occurring in putative RNA-binding proteins, including tumor suppressor and DNA damage-repair proteins [21,22]. C21orf51 We determined human and mouse C21orf51 by direct sequencing of full-length ESTs. They encode short predicted peptides of 58 and 55 residues (GenBank acc. nos. AY033901 and AY033902), which are 83% identical. C21orf57 The partial human C21orf57 cDNA encodes a potential ORF of > 299 residues at its 5⬘ end (C21orf57 form A, consisting of exons x to x + 4, GenBank acc. no. AY040873). We identified three alternatively spliced forms, either missing exon x + 2 (C21orf57 form B, GenBank acc. no. AY040874), or including two additional exons (y and y + 1, C21orf57 form C, GenBank acc. no. AY040875), or including two additional exons (y and y + 1) and missing exon x + 2 (C21orf57 form D, GenBank acc. no. AY040876). C21orf58 We have identified two human C21orf58 alternatively spliced forms with different exons 7 (C21orf42 form A and form B, GenBank acc. nos. AY039243 and AY039244). They encode predicted peptides of 239 and 194 residues, respectively. C21orf63/PRED34 We determined the mouse orthologue of PRED34 by direct sequencing; it contains an ORF of 1323 nucleotides and encodes a protein of 440 residues (mC21orf63, GenBank acc. no. AF358257). We inferred the sequence of its human orthologue from the HC21 genomic sequence and confirmed its existence by RT-PCR (C21orf63, GenBank acc. no. AF358258). We identified an alternatively spliced form that lacks exon 2 and encodes a truncated protein of only 70 residues (GenBank acc. no. AY040087). The nomenclature committee-approved gene symbol for this transcript is C21orf63 (human chromosome 21 open reading frame 63). Two of the three PRED34-predicted exons are present in the C21orf63 cDNA. C21orf63 encodes a protein with two D -galactoside/ L -rhamnose binding SUEL domains (sea urchin egg lectin, alias galactose-binding lectin domain, Pfam PF02140). A homologous gene called F32A7.3A was identified in C. elegans, but RNA interference experiments did not demonstrate any phenotypes specifically related to the gene [23]. C21orf66/GCFC The National Center for Biotechnology Information Annotation Project identified a GC-rich sequence DNAbinding factor candidate (GCFC) that maps between SYNJ1 and PRKCBP2 on 21q22.11. Our analysis of this genomic region with the NIX suite of programs showed that this
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
51
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
putative gene was longer than predicted. The sequences of clones isolated from various human cDNA pools showed that C21orf66 spans 20 exons. All intron–exon junctions conform to the donor–acceptor rules splice site (Table 2). The longest mRNA contains an ORF of 2745 nucleotides and encodes a predicted protein of 917 residues. The nomenclature committee-approved gene symbol for this transcript is C21orf66. We characterized four alternatively spliced isoforms: C21orf66 form A, encompassing exons 1–7, 9A, 10–16, and 18–20 (GenBank acc. no. AY033903); C21orf66 form B, including exons 1–7, 9A, and 10–17 (GenBank acc. no. AY033904); C21orf66 form C, containing exons 1–8 (GenBank acc. no. AY033905); and C21orf66 form D, encompassing exons 1–7, 9B, 10–16, and 18–20 (GenBank acc. no. AY033906). The D form is the most abundant in all the tissues tested by RT-PCR; this isoform is potentially bicistronic, encoding putative peptides of 511 and 411 residues, respectively. We isolated a partial mouse orthologue, mC21orf66, and characterized two alternatively spliced forms (mC21orf66 form A and mC21orf66 from D, GenBank acc. no. AY033907-908); therefore, both the longest and the potentially bicistronic isoforms were conserved between the rodents and the primates. The gene is also conserved in C. elegans (F43G9.12). The C21orf66 protein is homologous to the GCF2 (GC-binding factor-2) transcription factor between positions 1–50, 179–283, and 549–779 (positions relative to isoform A). GCF2 represses transcription of the plateletderived growth factor ␣ chain gene [24]. Moreover, C21orf66 contains a nucleosome assembly protein domain from amino acid 83 to amino acid 557. These partial homologies to a transcription repressor and histone-interacting protein indicate that C21orf66 is involved in the regulation of transcription. C21orf67/PRED54 Human C21orf67 contains an ORF of 615 nucleotides and encodes a putative protein of 204 residues (GenBank acc. no. AF380178). We identified an alternatively spliced form, which contains another exon 3 and encodes a truncated protein of 82 residues (GenBank acc. no. AY040088). C21orf67 maps to the PRED54 locus sequence [5], but none of the predicted exons of PRED54 are present in the C21orf67 cDNA sequence. C21orf68/PRED12 BLAST analysis showed that two human and mouse sequences recently deposited by the NEDO human cDNA sequencing project and the Functional Annotation of Mouse (FANTOM) Consortium, respectively (GenBank acc. nos. AK022689 and AK014255), were homologous to the PRED12 gene model. The two encoded peptides are 92% identical at the amino acid level. The nomenclature committee approved gene symbol for this gene is now C21orf68. All four predicted exons of PRED12 [5] are present in the C21orf68 cDNA. C21orf68 shows homology to hamster layilin, which is a membrane-binding site for talin in peripheral ruffles of spreading cells and a hyaluronan receptor [25,26]. The
52
similarity to C21orf68 extends over both the C-type lectin domain (64% identity (85 of 132 residues) and 79% similarity (104 of 132)) and the transmembrane region (33% identity (14 of 42) and 57% similarity (24 of 42)). These homologies indicate a potential involvement of C21orf68 in cell adhesion and motility. C21orf69/PRED54 Human C21orf69 contains an ORF of 492 nucleotides and encodes a protein of 163 predicted residues (GenBank acc. no. AY035381). C21orf69 maps to the wrongly predicted PRED54 locus [5], but its direction of transcription is opposite. C21orf70/PRED56 Mouse mC21orf70 and human C21orf70 form B are 76% identical at the amino acid level (GenBank acc. nos. AF391114 and AF391115). We identified an alternatively spliced isoform with a longer exon 2 (C21orf70 form A, GenBank acc. no. AF391113). Only a portion of one of the three PRED56 gene model predicted exons [5] is present in the C21orf70 cDNA sequence. PRED31 Is a Pseudogene Sequencing of the human ESTs corresponding to the PRED31 gene model [5] showed discrepancies with the sequence of the HC21 genomic sequence. However, these ESTs match the recently deposited sequences of human and mouse full-length cDNA of HSPC230 (human full-length cDNA cloned from CD34+ stem cells program, Shanghai Institute of Hematology, GenBank acc. no. AF151064, and FANTOM Consortium, GenBank acc. nos. AK006199, AK007370, and AK008795), which maps to human chromosome 6 (genomic clones RP11-59I9 and RP11-687L3, GenBank acc. nos. AL355586 and AC016916) and spans four exons (Table 2). Both of the 240 residue-encoded peptides are 80% identical at the amino acid level. A spliced pseudogene maps to 21q22.11 between TIAM1 and SOD1 (genomic sequence, GenBank acc. no. AP001711) and an unspliced pseudogene to HC6 (clones RP11-486M3 and RP11-795O6, GenBank acc. nos. AL357075 and AC068982), respectively. Expression Pattern of Chromosome 21 PREDs and ORFs We used qualitative RT-PCR of panels of normalized human and mouse cDNA to study the expression profiles of the different C21orf genes. Although C21orf7, C21orf15, C21orf51, C21orf63, C21orf66, and PRED31 are ubiquitously expressed, C21orf11, C21orf22, C21orf42, C21orf57, C21orf58, and C21orf68 show restricted patterns of expression (Table 3). The inability to detect expression of C21orf67, C21orf69, and C21orf70 in our assays indicates that these transcripts are of very low abundance. Actually, only four tags for each of the C21orf 67 and C21orf69 genes can be identified in dbEST. All 12 C21orf70 human ESTs deposited in dbEST were isolated from tumor tissues, indicative of high expression of this gene in neoplasia and low expression in normal tissues. To assess the quality of our RT-PCR
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
results, we analyzed the expression pattern of a ubiquitously expressed gene (C21orf66) and a specifically expressed gene (C21orf11) by probing multiple-adult-tissue northern blots. As expected, C21orf66 is expressed ubiquitously; we detected four mRNA species of approximately 12, 9, 5, and 4 kb in all tissues tested (brain, heart, kidney, liver, colon, and spleen; data not shown). Similarly, we also confirmed the restricted pattern of expression of C21orf11; we detected a single mRNA species of 1 kb in kidney and colon (data not shown).
DISCUSSION The initial annotation of the complete sequence of 21q showed 127 genes and 98 predicted transcripts identified by gene prediction programs and/or spliced ESTs [5,9,10]. To refine and update HC21 sequence annotation and evaluate the quality of the prediction of gene models, we characterized a set of C21orfs and PREDs. The C21orfs were previously unknown anonymous genes defined by matching spliced ESTs ([5], subcategories 4.1 and 4.2 of gene models), whereas PREDs were defined solely by computer prediction with no matching spliced ESTs ([5], subcategory 4.3). This study allowed us to update and correct previous gene models by combining two predicted genes in one (for example, C21orf11 and PRED44), by showing that the predicted gene does not correspond to a transcript as another gene is present at this locus (for example, PRED54 and PRED56), or by showing that the predicted gene corresponds to a pseudogene (for example, PRED31), whereas the analogous gene maps elsewhere on the human genome (Table 2). Our study confirms the intuitive prediction that gene models from subcategories 4.1 and 4.2 have stronger coding potential that those of subcategory 4.3. It also emphasizes the limit of in silico-only gene prediction, as four of the six characterized PREDs (PRED31, PRED44, PRED54, and PRED56) proved to be wrongly anticipated. Similarly, another recently published study pertinent to HC21 annotation found that the only PRED gene model (PRED43) mapping to the interval studied (21q22.13–21q22.2) by human and mouse comparative mapping [27] was not predicted correctly. Likewise, during annotation of HC22, it was noted from the beginning that almost 40% of the GENSCANpredicted genes did not form part of any gene confirmed by other means and included an unknown proportion of false positives. This observation led the authors to estimate that only 100 of the 325 HC22-predicted gene models would prove to be genuine genes [6]. Conversely, we expect that as-yet-undescribed HC21 genes will be characterized. Indeed, sequencing of the HC22 showed that 6% of annotated genes were not detected and 16% of all known exons were entirely missed by prediction programs [6]. Consistently, analysis of the HC22 gene coverage with ORESTES (ORF ESTs) identified 219 contigs that matched originally unannotated regions from HC22 [28]. Of these, 171 matched an available dbEST sequence not used in the initial
HC22 annotation. Likewise, we identified three new HC21 genes, C21orf67, C21orf69, and C21orf70, not predicted by computer programs at the PRED54 and PRED56 loci. These genes must be added to the three recently described PREDs (PRED67, PRED68, and PRED69) found by comparative mapping [27]. In addition to the effort described here to refine HC21 sequence annotation, we are re-analyzing the entire HC21 sequence incorporating new ESTs (including the ORESTES database [28]). The resulting revised HC21 transcriptome will have consequences for the rest of the genome regarding the quality of previous annotations and the total number of transcripts. It will also provide new candidates for genes involved in DS and other genetic disorders that map to HC21.
MATERIALS AND METHODS cDNA cloning. ESTs were obtained from Research Genetics (http://www.resgen.com) except where otherwise indicated, and were sequenced directly. The EST identities were as follows: C21orf7, mouse IMAGE clone 1153184; C21orf11 (PRED44), human and mouse IMAGE clones 741869, 1546160, and 1672666, the human GKCDWC07 clone was provided by Z. Han and the Chinese National Human Genome Center at Shanghai (CHGC), and the 5⬘ and the 3⬘ portions of the transcript were isolated through RACE experiments from kidney cDNA; C21orf15, human IMAGE clones 153429, 1473914, 1654356, and 2630302, the 5⬘ and the 3⬘ extremities of the transcript were isolated through RACE experiments from spleen cDNA; C21orf18, mouse IMAGE clone 717391;C21orf19, mouse IMAGE clones 976466, 1512004, 3468606, and 3708762; the mouse H3015D11 clone was part of the National Institute of Aging (NIA) 15K Mouse cDNA Clone Set (http://lgsun.grc.nia.nih.gov/cDNA/15k.html), freely distributed to the scientific community; C21orf22, human IMAGE clones 730871 and 1638304; C21orf42, human IMAGE clone 757422; C21orf50, mouse IMAGE clones 532600 and 1053712. For C21orf50 analysis, the mouse H3094B06 clone was part of the National Institute of Aging NIA 15K Mouse cDNA Clone Set. As the mouse EST sequences were overlapping with mouse Son, we determined a potential human C21orf50/SON cDNA contig in silico. In this updated transcript, C21orf50 spans exon 1, 2, and the first 98 bp of exon 3, and SON spans the last 4508 bp of exon 3 and the remaining exons (http://www.medgen.unige.ch/). To confirm that indeed C21orf50 and SON were portions of the same gene, we used RT-PCR from three different brain cDNA pools with oligonucleotides 9059 (5⬘-CTGCCAGGAGTCTACCAAATGA-3⬘) and 9060 (5⬘CTCTACTGCTGCTGTCACCAAA-3⬘), mapping to exon 2 and the 3⬘ region of exon 3, respectively. The size and sequence of this amplicon confirmed the new SON organization. C21orf51: mouse IMAGE clone 3152735. C21orf57: human IMAGE clones 1032047 and 2296208. C21orf58: human IMAGE clones 1416249 and 2586582. C21orf63 (SUE21/PRED34): mouse IMAP clone UI-MAO0-aby-g-12-0-UI. For C21orf66 (GCFC), the 3⬘ portion of the transcript was isolated through RT-PCR and RACE experiments from various cDNAs (brain, placenta, spleen, heart, small intestine, lung, testis, and fetal lung), whereas the 5⬘ portion was retrieved through screening a fetal brain cDNA library (Clontech, Palo Alto, CA). The mouse orthologue was cloned by screening of a mouse testis cDNA library (Clontech) and subsequent 5⬘-RACE experiments. C21orf67 (PRED54): human IMAGE clone 3134234. C21orf68 (PRED12): clone NT2RM4001813, from the NEDO human cDNA sequencing projects (Japan). C21orf69 (PRED54): human IMAGE clone 1705552. C21orf70 (PRED56): mouse IMAGE clone 3515288. Rapid amplification of cDNA ends. Both 5⬘- and 3⬘-RACE were done on various poly(A)+ RNAs using the Marathon cDNA Amplification Kit (Clontech). Double-stranded cDNA synthesis and adaptor ligations to the synthesized cDNA were done according to the manufacturers’ instructions. PCR products were purified and sequenced directly. Genomic structure. After identification of the transcript, the intron–exon junctions were determined by comparison of the genomic sequence to the cDNA
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.
53
Article
doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL
sequence using the est_genome software, available through the UK Human Genome Mapping Project (http://www.hgmp.mrc.ac.uk). Expression pattern studies. RT-PCR was done on panels of human and mouse cDNA, prepared as described [29]. The panels consisted of 20 human and 16 mouse pooled samples originating from different tissues and developmental stages The sequences of the primers used in this analysis are available on request. Human multiple-tissue northern blots (Origene, Rockville, MD) were hybridized with partial C21orf11 ORF or C21orf66 ORF cDNA following the manufacturer’s recommendations. Sample loading was assessed using an actin probe. Subcellular localization. U2OS and Cos7 cells were transfected with a 5⬘EGFP-C21orf11 (EGFP, enhanced GFP) or 5⬘-HA-C21orf11 (HA, hemagglutinin) expression vectors using Lipofectamine (Life Technologies, Gaithersburg, MD) as described before [30]. At 24 h after transfection, EGFP fluorescence was detected on cells fixed with 4% paraformaldehyde. Images were analyzed with Adobe Photoshop software. The anti-Giantin (324450; Calbiochem, La Jolla, CA) and ER-Tracker (Molecular Probes, Eugene, OR) compartment-specific markers and anti-␣, anti-, and anti-␣ tubulin (Sigma, St. Louis, MO) were used in the co-localization experiments. pEGFP and pHA-CDNA3 vectors without inserts were expressed in cells as controls.
ACKNOWLEDGEMENTS We thank Catia Attanasio, John-Louis Blouin, Olivier Menzel, Marguerite NeermanArbez, and Marie Wattenhofer, University of Geneva, for their suggestions and/or critical reading of the manuscript. We thank Joelle Michaud, Marie-Pierre Papasavvas, and Natalie Scamuffa, University of Geneva, for core assistance, and Anne Simon for preparation of the manuscript. This work was supported by grants from the Jérôme Lejeune foundation to R.L., A.R., and M.G., and from the Swiss FNRS 31.57149.99, the Swiss FNRS NPR38, the European Union/OFES, and the ChildCare foundation to S.E.A. RECEIVED FOR PUBLICATION JULY 5; ACCEPTED SEPTEMBER 7, 2001.
REFERENCES 1. Epstein, C. J. (1995). Down syndrome (trisomy 21). In The Metabolic and Molecular Bases of Inherited Disease (C. R. Scriver, A. L. Beaudet, W. S. Sly, and D. Valle, Eds.), pp. 749–794. McGraw Hill, New York, NY. 2. Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993). dbEST—database for “expressed sequence tags”. Nat. Genet. 4: 332–333. 3. Goffeau, A., et al. (1996). Life with 6000 genes. Science 274: 546. 4. The C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018. 5. Hattori, M., et al. (2000). The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium. Nature 405: 311–319. 6. Dunham, I., et al. (1999). The DNA sequence of human chromosome 22. Nature 402: 489–495. 7. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291: 1304–1351. 8. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409:
860–921. 9. Antonarakis, S. E. (2001). Chromosome 21: from sequence to applications. Curr. Opin. Genet. Dev. 11: 241–246. 10. Gardiner, K., and Davisson, M. T. (2000). The sequence of human chromosome 21 and implications for research into Down syndrome. Genome Biol. 1: 1–9. 11. Deutsch, S., Iseli, C., Bucher, P., Antonarakis, S. E., and Scott, H. S. (2001). A cSNP map and database for human chromosome 21. Genome Res. 11: 300–307. 12. Shibuya, H., et al. (1996). TAB1: an activator of the TAK1 MAPKKK in TGF- signal transduction. Science 272: 1179–1182. 13. Ninomiya-Tsuji, J., et al. (1999). The kinase TAK1 can activate the NIK-I B as well as the MAP kinase cascade in the IL-1 signalling pathway. Nature 398: 252–256. 14. Takaesu, G., et al. (2000). TAB2, a novel adaptor protein, mediates activation of TAK1 MAPKKK by linking TAK1 to TRAF6 in the IL-1 signal transduction pathway. Mol. Cell 5: 649–658. 15. Corcoran, M. M., et al. (1999). Dysregulation of cyclin dependent kinase 6 expression in splenic marginal zone lymphoma through chromosome 7q translocations. Oncogene 18: 6271–6277. 16. Bliskovski, V. V., Kirillov, A. V., Zakhar’ev, V. M., and Chumankov, I. M. (1992). Gen son cheloveka: bol’sho i maly transkripty soderzhat raznye 5⬘-kontsevye posledovatel’nosti. Mol. Biol. (Mosk) 26: 807–812. 17. Mattioni, T., et al. (1992). A cDNA clone for a novel nuclear protein with DNA binding activity. Chromosoma 101: 618–624. 18. Kikuno, R., et al. (1999). Prediction of the coding sequences of unidentified human genes. XIV. The complete sequences of 100 new cDNA clones from brain which code for large proteins in vitro. DNA Res. 6: 197–205. 19. Burd, C. G., and Dreyfuss, G. (1994). Conserved structures and diversity of functions of RNA-binding proteins. Science 265: 615–621. 20. Ryter, J. M., and Schultz, S. C. (1998). Molecular basis of double-stranded RNA-protein interactions: structure of a dsRNA-binding domain complexed with dsRNA. EMBO J. 17: 7505–7513. 21. Drabkin, H. A., et al. (1999). DEF-3(g16/NY-LU-12), an RNA binding protein from the 3p21.3 homozygous deletion region in SCLC. Oncogene 18: 2589–2597. 22. Aravind, L., and Koonin, E. V. (1999). G-patch: a new conserved domain in eukaryotic RNA-processing proteins and type D retroviral polyproteins. Trends Biochem. Sci. 24: 342–344. 23. Fraser, A. G., et al. (2000). Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature 408: 325–330. 24. Khachigian, L. M., et al. (1999). GC factor 2 represses platelet-derived growth factor A-chain gene transcription and is itself induced by arterial injury. Circ. Res. 84: 1258–1267. 25. Borowsky, M. L., and Hynes, R. O. (1998). Layilin, a novel talin-binding transmembrane protein homologous with C-type lectins, is localized in membrane ruffles. J. Cell. Biol. 143: 429–442. 26. Bono, P., Rubin, K., Higgins, J. M., and Hynes, R. O. (2001). Layilin, a novel integral membrane protein, is a hyaluronan receptor. Mol. Biol. Cell. 12: 891–900. 27. Pletcher, M. T., Wiltshire, T., Cabin, D. E., Villanueva, M., and Reeves, R. H. (2001). Use of comparative physical and sequence mapping to annotate mouse chromosome 16 and human chromosome 21. Genomics 74: 45–54. 28. de Souza, S. J., et al. (2000). Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. Proc. Natl. Acad. Sci. USA 97: 12690–12693. 29. Michaud, J., et al. (2000). Isolation and characterization of a human chromosome 21q22.3 gene (WDR4) and its mouse homologue that code for a WD-repeat protein. Genomics 68: 71–79. 30. Reymond, A., et al. (2001). The tripartite motif family identifies cell compartments. EMBO J. 20: 2140–2151.
Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under accession numbers AF358257-AF358258; AF360358; AF363446447; AF375989; AF380178-184; AF391112-115; AJ409094; AY033899-908; AY035381-383; AY037804; AY039243-244; AY040086-090; AY040873-876.
54
GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.