From PREDs and Open Reading Frames to cDNA Isolation: Revisiting the Human Chromosome 21 Transcription Map

From PREDs and Open Reading Frames to cDNA Isolation: Revisiting the Human Chromosome 21 Transcription Map

Article doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL From PREDs and Open Reading Frames to cDNA Isolation: Re...

201KB Sizes 0 Downloads 25 Views

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

From PREDs and Open Reading Frames to cDNA Isolation: Revisiting the Human Chromosome 21 Transcription Map Alexandre Reymond,1 Marc Friedli,1,* Charlotte Neergaard Henrichsen,1,* Fabian Chapot,1,* Samuel Deutsch,1,2 Catherine Ucla,1 Colette Rossier,1 Robert Lyle,1 Michel Guipponi,1 and Stylianos E. Antonarakis1,† 1 Division of Medical Genetics, University of Geneva Medical School, 1211 Geneva, Switzerland Graduate Program of Molecular and Cellular Biology, University of Geneva Medical School, 1211 Geneva, Switzerland

2

*These authors contributed equally to this work. †To whom correspondence and reprint requests should be addressed. Fax: 0041227025706. E-mail: [email protected].

A supernumerary copy of human chromosome 21 (HC21) causes Down syndrome. To understand the molecular pathogenesis of Down syndrome, it is necessary to identify all HC21 genes. The first annotation of the sequence of 21q confirmed 127 genes, and predicted an additional 98 previously unknown “anonymous” genes (predictions (PREDs) and open reading frames (C21orfs)), which were foreseen by exon prediction programs and/or spliced expressed sequence tags. These putative gene models still need to be confirmed as bona fide transcripts. Here we report the characterization and expression pattern of the putative transcripts C21orf7, C21orf11, C21orf15, C21orf18, C21orf19, C21orf22, C21orf42, C21orf50, C21orf51, C21orf57, and C21orf58, the GC-rich sequence DNA-binding factor candidate GCFC (also known as C21orf66), PRED12, PRED31, PRED34, PRED44, PRED54, and PRED56. Our analysis showed that most of the C21orfs originally defined by matching spliced expressed sequence tags were correctly predicted, whereas many of the PREDs, defined solely by computer prediction, do not correspond to genuine genes. Four of the six PREDs were incorrectly predicted: PRED44 and C21orf11 are portions of the same transcript, PRED31 is a pseudogene, and PRED54 and PRED56 were wrongly predicted. In contrast, PRED12 (now called C21orf68) and PRED34 (C21orf63) are now confirmed transcripts. We identified three new genes, C21orf67, C21orf69, and C21orf70, not previously predicted by any programs. This revision of the HC21 transcriptome has consequences for the entire genome regarding the quality of previous annotations and the total number of transcripts. It also provides new candidates for genes involved in Down syndrome and other genetic disorders that map to HC21. Key Words: human chromosome 21, transcription map, genomic sequences annotation, gene prediction, gene prediction, C21orf, Down syndrome, trisomy 21

INTRODUCTION Trisomy 21 causes Down syndrome (DS), which is the main genetic cause of mental retardation. It affects approximately 1 live-born child in 700 [1]. To understand the molecular pathogenesis of DS, it is necessary to identify all of the human chromosome 21 (HC21) genes. The sequences of the human and other eukaryotes’ genomes and of expressed sequence tags (ESTs) facilitate this cataloging [2–8]. The initial annotation of the complete sequence of the long arm of HC21 showed 127 genes and 98 predicted transcripts. HC21 genes and predicted transcripts were identified on the basis

46

of identity/similarity to known proteins, identity to spliced ESTs, and/or consistent exon/intron prediction [5,9,10]. To evaluate the quality of the HC21 gene prediction, we characterized a set of open reading frames (C21orfs) defined by matching spliced ESTs and PREDs (predictions) defined solely by computer prediction ([5], HC21 gene catalog, category 4). We report here the determination of the human and mouse sequence of these gene models, detailed information about their pattern of expression and of the alternatively spliced transcripts. Our data show that thorough and complete annotation of HC21 will require a single gene-based approach and cannot be based solely on exon prediction.

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved. 0888-7543/01 $35.00

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

C21orf11

PRED44

AJ409094/AF375989/AY040086

AF360358

C21orf15

AY040090

C21orf18

NM_017438, AF391112

AY037804

C21orf19

AF363446

AF363447

C21orf22

AY040089

C21orf42

AY035382-383

C21orf50

SON

+

142

142

90

TAB2-bdg of TAK1

235 (274)

235

78

2-19

440

439

77

Y92H12BR.6

297

297

99

C37C3.8

55

83

441

440

87

SUEL

F32A7.3A

917

> 855

95

GCF2

F43G9.12

273

92

Layilin

215 (230)

219

76

240

240

80

> 110 +

+

AF380179-184

+

81 +

AY033902

C21orf57

AY040873-876

+

> 299

C21orf58

AY039243-244

+

239 (194)

PRED34/SUE21

C21orf66

AY033901

2426

C21orf51

C21orf63

C. elegansh

AY033899

Homologiesg

AY033900

C21orf7

Identity [%]f

Mm accessionc

Mm size [a.a.]e

Hs accessionb

Hs size [a.a.]e

Aliasesa

Mm isoformsd

Genea

Hs isoformsd

TABLE 1: C21orfs

58

AF358258/AY040087

AF358257

+

AY033903-906

AY033907-908

+

C21orf67

PRED54

AF380178/AY040088

C21orf68

PRED12

AK022689

+

C21orf69

PRED54

AY035381

C21orf70

PRED56

AF391113-114

AF391115

HSPC230

PRED31

AF151064

AK006199

AK014255

+

D111 and dsRBD

204 273 163

+

a

Nomenclature committee approved gene symbol and aliases. b Homo sapiens GenBank accession(s) number(s). c Mus musculus GenBank accession(s) number(s). d The (+) signs specifies the identification of multiple H. sapiens and/or M. musculus isoforms. e Size of the encoded peptides in number of amino acid residues. f Percentage of identity at the amino acid level between H. sapiens and M. musculus encoded peptide. g Identified homologies, see text for details. h Name of homologous gene in C. elegans.

RESULTS To identify candidate genes for DS phenotypes and to contribute to the update of the HC21 annotation, we are systematically sequencing HC21 ESTs and their mouse orthologues that correspond to uncharacterized genes. Here we report the isolation and characterization of a set of 18 predicted gene models first reported in the annotation of the complete sequence of HC21 [5]. To avoid any bias in our results, we randomly selected these gene candidates. We sequenced human and/or mouse ESTs matching C21orf7, C21orf11, C21orf15, C21orf18, C21orf19, C21orf22, C21orf42, C21orf50, C21orf51, C21orf57, and C21orf58, PRED12, PRED31, PRED34, PRED44, PRED54, and PRED56, and the GC-rich sequence DNA-binding factor candidate GCFC (also known as C21orf66) gene models. When the sequence of the mouse cDNA was first determined, the sequence of its human orthologue was inferred from the HC21 genomic

sequence and its existence was confirmed by RT-PCR (C21orf7, C21orf19, PRED34, and PRED56, described below). Most results (mapping position, number of exons, human gene size, accession numbers of human and mouse cDNA and of human genomic sequence, presence of alternatively spliced isoforms, human and mouse encoded peptide size, sequence comparison, homologies, and result of model refinement) are summarized in Tables 1 and 2 and are described briefly below. The human genomic structures and intron–exon junctions are available (http://www.medgen. unige.ch/). This work allowed identification of new complementary single-nucleotide polymorphisms (cSNPs), which are added directly to the HC21 cSNP database (http://csnp.unige.ch/) [11]. C21orf7 Mouse mC21orf7 contains an ORF of 429 nucleotides and encodes a predicted protein of 142 residues (GenBank acc. no. AY033899) and is 90% identical to its human

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

47

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

andb

genomic seqc

from [bp]d

to [bp]d

exons [n]e

Hs gene size [kb]f

C21orf7

21q22.11

CCT8

BACH1

AL163249

149171

190718

4

42

none

C21orf11

21q22.3

BACE2

MX2

AL163285

125073

165679

9

40

combines two models in one

C21orf15

21q11

centromere

STCH

AL163204

206914

201692

≥6

≥6

none

C21orf18

21q22.13

RUNX1

CBR1

AP001724

296434

270591

11

25

none

8

?

unspliced pseudogene

ANKRD3

130076

131220

≥3

>2

none

Genea

Hs maps to

betweenb

TABLE 2: C21orfs mapping, gene structure, and model refinement

NT_005194

Gene model refinementf

C21orf19

2

C21orf22

21q22.3

TMPRSS2

C21orf42

21q21.3

NCAM2

VE-JAM

AP001693

177747

131873

4

46

none

C21orf50

21q22.11

GART

CRYZL1

AP001717

152018

186435

15

35

combines two models in one

C21orf51

21q22.12

KCNE2

KCNE1

AP001719

310376

319646

3

10

none

C21orf57

21q22.3

MCM3

PCNT

AP001759

304112

315450

≥5

≥ 12

none

AP001743

C21orf58

21q22.3

MCM3

PCNT

AP001759

335934

318321

8

18

none

C21orf63

21q22.11

KIAA0539

TCP10L

AP001714

38796

141134

8

103

none

C21orf66

21q22.11

SYNJ1

PRKCBP2

AP001715

58613

21895

20

37

larger gene

C21orf67

21q22.3

ITGB2

ADARB1

AL163300

316517

311400

4

6

wrong prediction/gene model position split in two

C21orf68

21q21.1

BTG3

PRSS7

AP001672

196597

219119

6

23

none

C21orf69

21q22.3

ITGB2

ADARB1

AL163300

311645

309486

2

3

wrong prediction/gene model position split in two

C21orf70

21q22.3

ITGB2

ADARB1

NT_011515

1672688

1709627

6

37

wrong prediction

HSPC230

6

4

23

pseudogene on HC21

AL355586/AC016916

a

Committee approved gene symbol and aliases. Gene names as specified in [5]. c H. sapiens genomic sequence GenBank accession numbers. d Gene start and end positions. e Number of exons and size in kilobases of the H. sapiens gene. f Gene model refinement, see text for details. b

counterpart at the amino acid level (GenBank acc. no. AY033900). C21orf7 is highly homologous to the TAB2 binding domain of Xenopus laevis TAK1 (54% identity (74 of 138) and 67% similarity (93 of 138)). TAK1 is a member of the MAPKKK family and is activated by many cytokines, whereas TAB2 has been shown to be an adapter linking TAK1 and TRAF6 and mediating TAK1 activation in response to interleukin 1 [12–14]. The homology of C21orf7 to TAK1 indicates that the HC21 gene might encode a peptide involved in the interleukin 1 signaling pathway. C21orf11/PRED44 We identified three human alternatively spliced isoforms: C21orf11 form A, encoding a 274-residue predicted peptide (GenBank acc. no. AJ409094); C21orf11 form B, missing exon 2 (GenBank acc. no. AF375989); and C21orf11 form C,

48

missing exons 2 and 3 (GenBank acc. no. AY040086). The gene spans both the C21orf11- and PRED44-predicted gene models [5]; however, only two of the five PRED44-predicted exons are present in the C21orf11 cDNA. The human C21orf11 form B and mouse mC21orf11 contain ORFs of 708 nucleotides and encode predicted proteins of 235 residues (GenBank acc. nos. AF360358 and AF375989, respectively). C21orf11 is 78% identical to its mouse counterpart at the amino acid level. C21orf11 is a member of the nonfunctionally characterized 2-19 protein family, a group of proteins present only in mammals. To determine where C21orf11 might exert its function, fusion proteins were expressed in Cos7 and U2OS cells. C21orf11 form A localizes in a discrete vesicular and perinuclear structure (Figs. 1A–1C, and data not shown). Control experiments with green fluorescent protein (GFP) empty vectors showed a

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

Article

FIG. 1. Intracellular localization of 5⬘ EGFP-C21orf11 form A fusion protein. Fluorescence microscopy analysis is shown of fixed cos7 cells transiently transfected with 5⬘ EGFP-C21orf11 expression vector. EGFP-C21orf11 fluorescence (A–D, green signal) was counterstained with 4,6-diamidino-2-phenylindole (DAPI; C, blue signal) and anti-␤-tubulin antibody (D, red signal).

mainly nuclear staining pattern, as previously published (data not shown). To better define the compartments identified by the C21orf11 protein, we studied colocalization using compartment-specific markers. The results show that the C21orf11 protein does not colocalize with the tested structures (microtubules, endoplasmic reticulum, and Golgi; Fig. 1D and data not shown). C21orf15 The partial human C21orf15 cDNA encodes a potential ORF of more than 110 residues at its carboxy end (GenBank acc. no. AY040090). Multiple attempts to clone the 3⬘ untranslated

region (UTR) of the gene by rapid amplification of cDNA ends (RACE) have failed. C21orf18 Mouse mC21orf18 contains an ORF of 1320 nucleotides and encodes a putative protein of 439 residues (GenBank acc. no. AY037804). During the preparation of this manuscript, its human counterpart was cloned and sequenced by the New Energy and Industrial Technology Development Organization (NEDO) human cDNA sequencing project (C21orf18 spliced form A, GenBank acc. no. NM_017438). The two encoded peptides are 77% identical at the amino acid level. We identified

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

49

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

+

Thymus Stomach

+

Colon

+

+

+

+

+

+

+

+ +

+

+ +

+

+

Ovary

+

+

+

+ +

+ +

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+ (+)

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

Fetal brain

+

+

Fetal heart

+

Fetal lung

+ +

+

+

+ +

+

+

+ +

+

+

+ +

+

+

+

+

+

+

+

+

Fetal kidney

+

+

+

+

+

+ +

+

8.5 day emb.

+ +

+

+

Bone marrow

Fetal liver

+

+

+

+

+

Eyes

+

+

+

Placenta Skin

+

+ +

+

+

+

Testis

+

+

+

+

Muscle

Hs C21orf70

+

+

Hs C21orf69

+

Mm C21orf68

+

Hs C21orf68

+

Hs C21orf67

+

+

Mm C21orf66 +

+

+

Sm. intestine

PBLs

+

+

Hs C21orf66

Hs C21orf58

Hs C21orf57 +

+

+ +

Liver

Lung

Hs C21orf42

+

Spleen

+ +

+

Mm PRED31

+

+

Hs PRED31

Kidney

+

Mm C21orf63

+

Hs C21orf63

+

+

Mm C21orf51

+

Heart

Hs C21orf51

Brain

Hs C21orf22

Hs C21orf15

Mm C21orf11

Hs C21orf11

Mm C21orf7

Hs C21orf7

TABLE 3: C21orfs expressiona

+

+

+

+

+

+

+

+ +

+ +

+ +

+

+

+

9.5 day emb.

+

+

+

+

+

+

12.5 day emb.

+

+

+

+

+

+

19 day emb.

+

+

+

+

+

+

+

a

Grayed areas specify untested tissues/developmental stages.

an alternatively spliced isoform in humans missing exon 4 (C21orf18 form B, GenBank acc. no. AF391112). C21orf19 Mouse mC21orf19 and human C21orf19 contain ORFs of 894 nucleotides and encode predicted proteins of 297 residues (GenBank acc. no. AF363446 and AF363447). The human gene maps to chromosome 2 (HC2) (clone RP11-437L5, GenBank acc. no. AC022481, and clone RP11-431P19, GenBank acc. no.

50

AC010981) and contains eight exons, whereas an unspliced copy is present on HC21. The HC21 copy shows 12 mismatches and deletion of two closely positioned stretches of 4 and 2 nucleotides, respectively, over 1762 nucleotides compared with the HC2 C21orf19 cDNA. These changes do not create any in-frame termination codon, but the following observations support the hypothesis that the HC21 copy is a pseudogene. All 49 human ESTs deposited in dbEST (EST database) correspond to the HC2 C21orf19 copy.

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

Moreover, the mouse orthologue is identical to the HC2 copy in four amino acid residues that are different between the HC2 and the HC21 copies. Finally, the loss of two codons plus the resulting change of seven codons in the HC21 copy are not present in the mouse gene. We cannot, however, totally rule out the possibility that the HC21 copy corresponds to an extremely rare transcript, encoded by an intronless gene. The C21orf19-encoded peptide is a member of a highly conserved family of proteins containing the aminoacyl-transfer RNA synthetases class-II signature 2 (PROSITE). This group of proteins includes Drosophila melanogaster YC4P, Caenorhabditis elegans C37C3.8, Arabidopsis thaliana 2G2520, Dictyostelium discoideum 2034, Schizosaccharomyces pombe C4H3.04C, Saccharomyces cerevisiae YJR83.9, and Pyrococcus abyssi PAB0370. C21orf22 The partial human C21orf22 cDNA encodes a potential ORF at its 5⬘ end (GenBank acc. no. AY040089). Multiple attempts to clone the 5⬘ UTR of the gene by RACE have failed. C21orf42 We have identified two C21orf42 alternatively spliced forms encoding exons 1, 2, 4, and 5 (C21orf42 form A, GenBank acc. no. AY035382) and exons 1–5 (C21orf42 form B, GenBank acc. no. AY035383), respectively. Both contain an ORF of 246 nucleotides and encode a putative protein of 81 residues. The translocation t(7;21)(q21.2;q22), found in a patient with splenic marginal zone lymphoma, juxtaposes genomic DNA 66 kb upstream of the transcription start site of CDK6 (cyclindependent kinase 6) and the C21orf42 gene [15], probably causing dysregulation of the expression of the kinase. C21orf50/SON Full-length sequencing of mouse ESTs corresponding to C21orf50 showed that this predicted ORF and the published mouse orthologue of human SON/KIAA1019 cDNA overlap. We therefore reanalyzed all human ESTs and cDNAs corresponding to both the C21orf50 gene model and SON and confirmed our prediction by RT-PCR. We determined that the longest SON ORF spans over 7281 bp and encodes a predicted protein of 2426 residues, rather than the previously reported smaller predicted protein of 1929 amino acid residues [16–18]. We characterized six alternatively spliced isoforms, SON forms A–F, and revised the genomic structure of the gene (GenBank acc. no. AF380177–184). All intron–exon junctions conform to the donor–acceptor consensus sequence (http://www.medgen.unige.ch/). Exon 3 is remarkably large, with a size of 5916 bp. The SON protein contains both a D111 and a double-stranded ribosomebinding domain (dsRBD) (residues 2311–2348 and 2371–2408, respectively, numbered according to SON form F, GenBank acc. no. AF380184). The dsRBD proteins (also known as double-stranded RNA-binding motif (DSRM) proteins) specifically recognize double-stranded RNA [19,20]. They are involved mainly in post-transcriptional gene

Article

regulation by preventing protein expression or by mediating RNA localization. The D111 or G-patch domain is a short conserved region of about 40 amino acids occurring in putative RNA-binding proteins, including tumor suppressor and DNA damage-repair proteins [21,22]. C21orf51 We determined human and mouse C21orf51 by direct sequencing of full-length ESTs. They encode short predicted peptides of 58 and 55 residues (GenBank acc. nos. AY033901 and AY033902), which are 83% identical. C21orf57 The partial human C21orf57 cDNA encodes a potential ORF of > 299 residues at its 5⬘ end (C21orf57 form A, consisting of exons x to x + 4, GenBank acc. no. AY040873). We identified three alternatively spliced forms, either missing exon x + 2 (C21orf57 form B, GenBank acc. no. AY040874), or including two additional exons (y and y + 1, C21orf57 form C, GenBank acc. no. AY040875), or including two additional exons (y and y + 1) and missing exon x + 2 (C21orf57 form D, GenBank acc. no. AY040876). C21orf58 We have identified two human C21orf58 alternatively spliced forms with different exons 7 (C21orf42 form A and form B, GenBank acc. nos. AY039243 and AY039244). They encode predicted peptides of 239 and 194 residues, respectively. C21orf63/PRED34 We determined the mouse orthologue of PRED34 by direct sequencing; it contains an ORF of 1323 nucleotides and encodes a protein of 440 residues (mC21orf63, GenBank acc. no. AF358257). We inferred the sequence of its human orthologue from the HC21 genomic sequence and confirmed its existence by RT-PCR (C21orf63, GenBank acc. no. AF358258). We identified an alternatively spliced form that lacks exon 2 and encodes a truncated protein of only 70 residues (GenBank acc. no. AY040087). The nomenclature committee-approved gene symbol for this transcript is C21orf63 (human chromosome 21 open reading frame 63). Two of the three PRED34-predicted exons are present in the C21orf63 cDNA. C21orf63 encodes a protein with two D -galactoside/ L -rhamnose binding SUEL domains (sea urchin egg lectin, alias galactose-binding lectin domain, Pfam PF02140). A homologous gene called F32A7.3A was identified in C. elegans, but RNA interference experiments did not demonstrate any phenotypes specifically related to the gene [23]. C21orf66/GCFC The National Center for Biotechnology Information Annotation Project identified a GC-rich sequence DNAbinding factor candidate (GCFC) that maps between SYNJ1 and PRKCBP2 on 21q22.11. Our analysis of this genomic region with the NIX suite of programs showed that this

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

51

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

putative gene was longer than predicted. The sequences of clones isolated from various human cDNA pools showed that C21orf66 spans 20 exons. All intron–exon junctions conform to the donor–acceptor rules splice site (Table 2). The longest mRNA contains an ORF of 2745 nucleotides and encodes a predicted protein of 917 residues. The nomenclature committee-approved gene symbol for this transcript is C21orf66. We characterized four alternatively spliced isoforms: C21orf66 form A, encompassing exons 1–7, 9A, 10–16, and 18–20 (GenBank acc. no. AY033903); C21orf66 form B, including exons 1–7, 9A, and 10–17 (GenBank acc. no. AY033904); C21orf66 form C, containing exons 1–8 (GenBank acc. no. AY033905); and C21orf66 form D, encompassing exons 1–7, 9B, 10–16, and 18–20 (GenBank acc. no. AY033906). The D form is the most abundant in all the tissues tested by RT-PCR; this isoform is potentially bicistronic, encoding putative peptides of 511 and 411 residues, respectively. We isolated a partial mouse orthologue, mC21orf66, and characterized two alternatively spliced forms (mC21orf66 form A and mC21orf66 from D, GenBank acc. no. AY033907-908); therefore, both the longest and the potentially bicistronic isoforms were conserved between the rodents and the primates. The gene is also conserved in C. elegans (F43G9.12). The C21orf66 protein is homologous to the GCF2 (GC-binding factor-2) transcription factor between positions 1–50, 179–283, and 549–779 (positions relative to isoform A). GCF2 represses transcription of the plateletderived growth factor ␣ chain gene [24]. Moreover, C21orf66 contains a nucleosome assembly protein domain from amino acid 83 to amino acid 557. These partial homologies to a transcription repressor and histone-interacting protein indicate that C21orf66 is involved in the regulation of transcription. C21orf67/PRED54 Human C21orf67 contains an ORF of 615 nucleotides and encodes a putative protein of 204 residues (GenBank acc. no. AF380178). We identified an alternatively spliced form, which contains another exon 3 and encodes a truncated protein of 82 residues (GenBank acc. no. AY040088). C21orf67 maps to the PRED54 locus sequence [5], but none of the predicted exons of PRED54 are present in the C21orf67 cDNA sequence. C21orf68/PRED12 BLAST analysis showed that two human and mouse sequences recently deposited by the NEDO human cDNA sequencing project and the Functional Annotation of Mouse (FANTOM) Consortium, respectively (GenBank acc. nos. AK022689 and AK014255), were homologous to the PRED12 gene model. The two encoded peptides are 92% identical at the amino acid level. The nomenclature committee approved gene symbol for this gene is now C21orf68. All four predicted exons of PRED12 [5] are present in the C21orf68 cDNA. C21orf68 shows homology to hamster layilin, which is a membrane-binding site for talin in peripheral ruffles of spreading cells and a hyaluronan receptor [25,26]. The

52

similarity to C21orf68 extends over both the C-type lectin domain (64% identity (85 of 132 residues) and 79% similarity (104 of 132)) and the transmembrane region (33% identity (14 of 42) and 57% similarity (24 of 42)). These homologies indicate a potential involvement of C21orf68 in cell adhesion and motility. C21orf69/PRED54 Human C21orf69 contains an ORF of 492 nucleotides and encodes a protein of 163 predicted residues (GenBank acc. no. AY035381). C21orf69 maps to the wrongly predicted PRED54 locus [5], but its direction of transcription is opposite. C21orf70/PRED56 Mouse mC21orf70 and human C21orf70 form B are 76% identical at the amino acid level (GenBank acc. nos. AF391114 and AF391115). We identified an alternatively spliced isoform with a longer exon 2 (C21orf70 form A, GenBank acc. no. AF391113). Only a portion of one of the three PRED56 gene model predicted exons [5] is present in the C21orf70 cDNA sequence. PRED31 Is a Pseudogene Sequencing of the human ESTs corresponding to the PRED31 gene model [5] showed discrepancies with the sequence of the HC21 genomic sequence. However, these ESTs match the recently deposited sequences of human and mouse full-length cDNA of HSPC230 (human full-length cDNA cloned from CD34+ stem cells program, Shanghai Institute of Hematology, GenBank acc. no. AF151064, and FANTOM Consortium, GenBank acc. nos. AK006199, AK007370, and AK008795), which maps to human chromosome 6 (genomic clones RP11-59I9 and RP11-687L3, GenBank acc. nos. AL355586 and AC016916) and spans four exons (Table 2). Both of the 240 residue-encoded peptides are 80% identical at the amino acid level. A spliced pseudogene maps to 21q22.11 between TIAM1 and SOD1 (genomic sequence, GenBank acc. no. AP001711) and an unspliced pseudogene to HC6 (clones RP11-486M3 and RP11-795O6, GenBank acc. nos. AL357075 and AC068982), respectively. Expression Pattern of Chromosome 21 PREDs and ORFs We used qualitative RT-PCR of panels of normalized human and mouse cDNA to study the expression profiles of the different C21orf genes. Although C21orf7, C21orf15, C21orf51, C21orf63, C21orf66, and PRED31 are ubiquitously expressed, C21orf11, C21orf22, C21orf42, C21orf57, C21orf58, and C21orf68 show restricted patterns of expression (Table 3). The inability to detect expression of C21orf67, C21orf69, and C21orf70 in our assays indicates that these transcripts are of very low abundance. Actually, only four tags for each of the C21orf 67 and C21orf69 genes can be identified in dbEST. All 12 C21orf70 human ESTs deposited in dbEST were isolated from tumor tissues, indicative of high expression of this gene in neoplasia and low expression in normal tissues. To assess the quality of our RT-PCR

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

results, we analyzed the expression pattern of a ubiquitously expressed gene (C21orf66) and a specifically expressed gene (C21orf11) by probing multiple-adult-tissue northern blots. As expected, C21orf66 is expressed ubiquitously; we detected four mRNA species of approximately 12, 9, 5, and 4 kb in all tissues tested (brain, heart, kidney, liver, colon, and spleen; data not shown). Similarly, we also confirmed the restricted pattern of expression of C21orf11; we detected a single mRNA species of 1 kb in kidney and colon (data not shown).

DISCUSSION The initial annotation of the complete sequence of 21q showed 127 genes and 98 predicted transcripts identified by gene prediction programs and/or spliced ESTs [5,9,10]. To refine and update HC21 sequence annotation and evaluate the quality of the prediction of gene models, we characterized a set of C21orfs and PREDs. The C21orfs were previously unknown anonymous genes defined by matching spliced ESTs ([5], subcategories 4.1 and 4.2 of gene models), whereas PREDs were defined solely by computer prediction with no matching spliced ESTs ([5], subcategory 4.3). This study allowed us to update and correct previous gene models by combining two predicted genes in one (for example, C21orf11 and PRED44), by showing that the predicted gene does not correspond to a transcript as another gene is present at this locus (for example, PRED54 and PRED56), or by showing that the predicted gene corresponds to a pseudogene (for example, PRED31), whereas the analogous gene maps elsewhere on the human genome (Table 2). Our study confirms the intuitive prediction that gene models from subcategories 4.1 and 4.2 have stronger coding potential that those of subcategory 4.3. It also emphasizes the limit of in silico-only gene prediction, as four of the six characterized PREDs (PRED31, PRED44, PRED54, and PRED56) proved to be wrongly anticipated. Similarly, another recently published study pertinent to HC21 annotation found that the only PRED gene model (PRED43) mapping to the interval studied (21q22.13–21q22.2) by human and mouse comparative mapping [27] was not predicted correctly. Likewise, during annotation of HC22, it was noted from the beginning that almost 40% of the GENSCANpredicted genes did not form part of any gene confirmed by other means and included an unknown proportion of false positives. This observation led the authors to estimate that only 100 of the 325 HC22-predicted gene models would prove to be genuine genes [6]. Conversely, we expect that as-yet-undescribed HC21 genes will be characterized. Indeed, sequencing of the HC22 showed that 6% of annotated genes were not detected and 16% of all known exons were entirely missed by prediction programs [6]. Consistently, analysis of the HC22 gene coverage with ORESTES (ORF ESTs) identified 219 contigs that matched originally unannotated regions from HC22 [28]. Of these, 171 matched an available dbEST sequence not used in the initial

HC22 annotation. Likewise, we identified three new HC21 genes, C21orf67, C21orf69, and C21orf70, not predicted by computer programs at the PRED54 and PRED56 loci. These genes must be added to the three recently described PREDs (PRED67, PRED68, and PRED69) found by comparative mapping [27]. In addition to the effort described here to refine HC21 sequence annotation, we are re-analyzing the entire HC21 sequence incorporating new ESTs (including the ORESTES database [28]). The resulting revised HC21 transcriptome will have consequences for the rest of the genome regarding the quality of previous annotations and the total number of transcripts. It will also provide new candidates for genes involved in DS and other genetic disorders that map to HC21.

MATERIALS AND METHODS cDNA cloning. ESTs were obtained from Research Genetics (http://www.resgen.com) except where otherwise indicated, and were sequenced directly. The EST identities were as follows: C21orf7, mouse IMAGE clone 1153184; C21orf11 (PRED44), human and mouse IMAGE clones 741869, 1546160, and 1672666, the human GKCDWC07 clone was provided by Z. Han and the Chinese National Human Genome Center at Shanghai (CHGC), and the 5⬘ and the 3⬘ portions of the transcript were isolated through RACE experiments from kidney cDNA; C21orf15, human IMAGE clones 153429, 1473914, 1654356, and 2630302, the 5⬘ and the 3⬘ extremities of the transcript were isolated through RACE experiments from spleen cDNA; C21orf18, mouse IMAGE clone 717391;C21orf19, mouse IMAGE clones 976466, 1512004, 3468606, and 3708762; the mouse H3015D11 clone was part of the National Institute of Aging (NIA) 15K Mouse cDNA Clone Set (http://lgsun.grc.nia.nih.gov/cDNA/15k.html), freely distributed to the scientific community; C21orf22, human IMAGE clones 730871 and 1638304; C21orf42, human IMAGE clone 757422; C21orf50, mouse IMAGE clones 532600 and 1053712. For C21orf50 analysis, the mouse H3094B06 clone was part of the National Institute of Aging NIA 15K Mouse cDNA Clone Set. As the mouse EST sequences were overlapping with mouse Son, we determined a potential human C21orf50/SON cDNA contig in silico. In this updated transcript, C21orf50 spans exon 1, 2, and the first 98 bp of exon 3, and SON spans the last 4508 bp of exon 3 and the remaining exons (http://www.medgen.unige.ch/). To confirm that indeed C21orf50 and SON were portions of the same gene, we used RT-PCR from three different brain cDNA pools with oligonucleotides 9059 (5⬘-CTGCCAGGAGTCTACCAAATGA-3⬘) and 9060 (5⬘CTCTACTGCTGCTGTCACCAAA-3⬘), mapping to exon 2 and the 3⬘ region of exon 3, respectively. The size and sequence of this amplicon confirmed the new SON organization. C21orf51: mouse IMAGE clone 3152735. C21orf57: human IMAGE clones 1032047 and 2296208. C21orf58: human IMAGE clones 1416249 and 2586582. C21orf63 (SUE21/PRED34): mouse IMAP clone UI-MAO0-aby-g-12-0-UI. For C21orf66 (GCFC), the 3⬘ portion of the transcript was isolated through RT-PCR and RACE experiments from various cDNAs (brain, placenta, spleen, heart, small intestine, lung, testis, and fetal lung), whereas the 5⬘ portion was retrieved through screening a fetal brain cDNA library (Clontech, Palo Alto, CA). The mouse orthologue was cloned by screening of a mouse testis cDNA library (Clontech) and subsequent 5⬘-RACE experiments. C21orf67 (PRED54): human IMAGE clone 3134234. C21orf68 (PRED12): clone NT2RM4001813, from the NEDO human cDNA sequencing projects (Japan). C21orf69 (PRED54): human IMAGE clone 1705552. C21orf70 (PRED56): mouse IMAGE clone 3515288. Rapid amplification of cDNA ends. Both 5⬘- and 3⬘-RACE were done on various poly(A)+ RNAs using the Marathon cDNA Amplification Kit (Clontech). Double-stranded cDNA synthesis and adaptor ligations to the synthesized cDNA were done according to the manufacturers’ instructions. PCR products were purified and sequenced directly. Genomic structure. After identification of the transcript, the intron–exon junctions were determined by comparison of the genomic sequence to the cDNA

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.

53

Article

doi:10.1006/geno.2001.6640, available online at http://www.idealibrary.com on IDEAL

sequence using the est_genome software, available through the UK Human Genome Mapping Project (http://www.hgmp.mrc.ac.uk). Expression pattern studies. RT-PCR was done on panels of human and mouse cDNA, prepared as described [29]. The panels consisted of 20 human and 16 mouse pooled samples originating from different tissues and developmental stages The sequences of the primers used in this analysis are available on request. Human multiple-tissue northern blots (Origene, Rockville, MD) were hybridized with partial C21orf11 ORF or C21orf66 ORF cDNA following the manufacturer’s recommendations. Sample loading was assessed using an actin probe. Subcellular localization. U2OS and Cos7 cells were transfected with a 5⬘EGFP-C21orf11 (EGFP, enhanced GFP) or 5⬘-HA-C21orf11 (HA, hemagglutinin) expression vectors using Lipofectamine (Life Technologies, Gaithersburg, MD) as described before [30]. At 24 h after transfection, EGFP fluorescence was detected on cells fixed with 4% paraformaldehyde. Images were analyzed with Adobe Photoshop software. The anti-Giantin (324450; Calbiochem, La Jolla, CA) and ER-Tracker (Molecular Probes, Eugene, OR) compartment-specific markers and anti-␣, anti-␤, and anti-␣ tubulin (Sigma, St. Louis, MO) were used in the co-localization experiments. pEGFP and pHA-CDNA3 vectors without inserts were expressed in cells as controls.

ACKNOWLEDGEMENTS We thank Catia Attanasio, John-Louis Blouin, Olivier Menzel, Marguerite NeermanArbez, and Marie Wattenhofer, University of Geneva, for their suggestions and/or critical reading of the manuscript. We thank Joelle Michaud, Marie-Pierre Papasavvas, and Natalie Scamuffa, University of Geneva, for core assistance, and Anne Simon for preparation of the manuscript. This work was supported by grants from the Jérôme Lejeune foundation to R.L., A.R., and M.G., and from the Swiss FNRS 31.57149.99, the Swiss FNRS NPR38, the European Union/OFES, and the ChildCare foundation to S.E.A. RECEIVED FOR PUBLICATION JULY 5; ACCEPTED SEPTEMBER 7, 2001.

REFERENCES 1. Epstein, C. J. (1995). Down syndrome (trisomy 21). In The Metabolic and Molecular Bases of Inherited Disease (C. R. Scriver, A. L. Beaudet, W. S. Sly, and D. Valle, Eds.), pp. 749–794. McGraw Hill, New York, NY. 2. Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993). dbEST—database for “expressed sequence tags”. Nat. Genet. 4: 332–333. 3. Goffeau, A., et al. (1996). Life with 6000 genes. Science 274: 546. 4. The C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018. 5. Hattori, M., et al. (2000). The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium. Nature 405: 311–319. 6. Dunham, I., et al. (1999). The DNA sequence of human chromosome 22. Nature 402: 489–495. 7. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291: 1304–1351. 8. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409:

860–921. 9. Antonarakis, S. E. (2001). Chromosome 21: from sequence to applications. Curr. Opin. Genet. Dev. 11: 241–246. 10. Gardiner, K., and Davisson, M. T. (2000). The sequence of human chromosome 21 and implications for research into Down syndrome. Genome Biol. 1: 1–9. 11. Deutsch, S., Iseli, C., Bucher, P., Antonarakis, S. E., and Scott, H. S. (2001). A cSNP map and database for human chromosome 21. Genome Res. 11: 300–307. 12. Shibuya, H., et al. (1996). TAB1: an activator of the TAK1 MAPKKK in TGF-␤ signal transduction. Science 272: 1179–1182. 13. Ninomiya-Tsuji, J., et al. (1999). The kinase TAK1 can activate the NIK-I ␬B as well as the MAP kinase cascade in the IL-1 signalling pathway. Nature 398: 252–256. 14. Takaesu, G., et al. (2000). TAB2, a novel adaptor protein, mediates activation of TAK1 MAPKKK by linking TAK1 to TRAF6 in the IL-1 signal transduction pathway. Mol. Cell 5: 649–658. 15. Corcoran, M. M., et al. (1999). Dysregulation of cyclin dependent kinase 6 expression in splenic marginal zone lymphoma through chromosome 7q translocations. Oncogene 18: 6271–6277. 16. Bliskovski, V. V., Kirillov, A. V., Zakhar’ev, V. M., and Chumankov, I. M. (1992). Gen son cheloveka: bol’sho i maly transkripty soderzhat raznye 5⬘-kontsevye posledovatel’nosti. Mol. Biol. (Mosk) 26: 807–812. 17. Mattioni, T., et al. (1992). A cDNA clone for a novel nuclear protein with DNA binding activity. Chromosoma 101: 618–624. 18. Kikuno, R., et al. (1999). Prediction of the coding sequences of unidentified human genes. XIV. The complete sequences of 100 new cDNA clones from brain which code for large proteins in vitro. DNA Res. 6: 197–205. 19. Burd, C. G., and Dreyfuss, G. (1994). Conserved structures and diversity of functions of RNA-binding proteins. Science 265: 615–621. 20. Ryter, J. M., and Schultz, S. C. (1998). Molecular basis of double-stranded RNA-protein interactions: structure of a dsRNA-binding domain complexed with dsRNA. EMBO J. 17: 7505–7513. 21. Drabkin, H. A., et al. (1999). DEF-3(g16/NY-LU-12), an RNA binding protein from the 3p21.3 homozygous deletion region in SCLC. Oncogene 18: 2589–2597. 22. Aravind, L., and Koonin, E. V. (1999). G-patch: a new conserved domain in eukaryotic RNA-processing proteins and type D retroviral polyproteins. Trends Biochem. Sci. 24: 342–344. 23. Fraser, A. G., et al. (2000). Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature 408: 325–330. 24. Khachigian, L. M., et al. (1999). GC factor 2 represses platelet-derived growth factor A-chain gene transcription and is itself induced by arterial injury. Circ. Res. 84: 1258–1267. 25. Borowsky, M. L., and Hynes, R. O. (1998). Layilin, a novel talin-binding transmembrane protein homologous with C-type lectins, is localized in membrane ruffles. J. Cell. Biol. 143: 429–442. 26. Bono, P., Rubin, K., Higgins, J. M., and Hynes, R. O. (2001). Layilin, a novel integral membrane protein, is a hyaluronan receptor. Mol. Biol. Cell. 12: 891–900. 27. Pletcher, M. T., Wiltshire, T., Cabin, D. E., Villanueva, M., and Reeves, R. H. (2001). Use of comparative physical and sequence mapping to annotate mouse chromosome 16 and human chromosome 21. Genomics 74: 45–54. 28. de Souza, S. J., et al. (2000). Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. Proc. Natl. Acad. Sci. USA 97: 12690–12693. 29. Michaud, J., et al. (2000). Isolation and characterization of a human chromosome 21q22.3 gene (WDR4) and its mouse homologue that code for a WD-repeat protein. Genomics 68: 71–79. 30. Reymond, A., et al. (2001). The tripartite motif family identifies cell compartments. EMBO J. 20: 2140–2151.

Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under accession numbers AF358257-AF358258; AF360358; AF363446447; AF375989; AF380178-184; AF391112-115; AJ409094; AY033899-908; AY035381-383; AY037804; AY039243-244; AY040086-090; AY040873-876.

54

GENOMICS Vol. 78, Numbers 1-2, November 2001 Copyright © 2001 by Academic Press. All rights of reproduction in any form reserved.