fierce Loci

fierce Loci

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL Article Novel Vertebrate Genes and Putative Regulatory Elements ...

693KB Sizes 20 Downloads 20 Views

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

Article

Novel Vertebrate Genes and Putative Regulatory Elements Identified at Kidney Disease and NR2E1/fierce Loci Brett S. Abrahams,1,2 Grace M. Mak,1 Melissa L. Berry,3 Diana L. Palmquist,1 Jennifer R. Saionz,3 Alice Tay,4 Y. H. Tan,4 Sydney Brenner,4 Elizabeth M. Simpson,1,*,† and Byrappa Venkatesh4,* 1

Centre for Molecular Medicine & Therapeutics, British Columbia Research Institute for Children’s and Women’s Health, and Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, British Columbia, V5Z 4H4, Canada 2 Graduate Program in Neuroscience, University of British Columbia, British Columbia, Canada 3 The Jackson Laboratory, 600 Main Street, Bar Harbor, Maine 04609, USA 4 Institute of Molecular and Cell Biology, Singapore 117609

*These authors contributed equally to this work.



To whom correspondence and reprint requests should be addressed. Fax: (604) 875-3819. E-mail: [email protected].

Fierce (frc) mice are deleted for nuclear receptor 2e1 (Nr2e1), and exhibit cerebral hypoplasia, blindness, and extreme aggression. To characterize the Nr2e1 locus, which may also contain the mouse kidney disease (kd) allele, we compared sequence from human, mouse, and the puffer fish Fugu rubripes. We identified a novel gene, c222389, containing conserved elements in noncoding regions. We also discovered a novel vertebrate gene conserved across its length in prokaryotes and invertebrates. Based on a dramatic upregulation in lactating breast, we named this gene lactation elevated-1 (LACE1) . Two separate 100-bp elements within the first NR2E1 intron were virtually identical between the three species, despite an estimated 450 million years of divergent evolution. These elements represent strong candidates for functional NR2E1 regulatory elements in vertebrates. A high degree of conservation across NR2E1 combined with a lack of interspersed repeats suggests that an array of regulatory elements embedded within the gene is required for proper gene expression. Key Words: brain, comparative sequence analysis, eye, HSPC019, SNX3

INTRODUCTION Although it has been known for some time that the orphan nuclear receptor tailless is essential for embryonic development in Drosophila melanogaster [1], much less is known about the homologous vertebrate gene, nuclear receptor 2E1 (NR2E1). Previous work has shown that fierce (frc) mice, homozygous for a null allele at Nr2e1 [2], and strains targeted for Nr2e1 [3,4] show a developmental brain phenotype associated with cerebral hypoplasia, blindness, and extreme aggression. Despite these observations, the function of NR2E1 in vertebrates remains unclear, and nothing is known about its transcriptional regulation. Comparative sequence analysis has been shown to be a powerful method to effectively reduce a large amount of genomic sequence to a core set of functionally important regions amenable to experimental inquiry [5,6]. The puffer

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved. 0888-7543/02 $35.00

fish Fugu rubripes is especially attractive as a “model genome,” as it is only one-eighth the size of the human genome yet contains the full complement of vertebrate genes [7]. Because the intergenic and intronic sequences in F. rubripes are highly compressed and repetitive elements are reduced relative to other vertebrates [8,9], it is relatively easy to identify and characterize conserved regulatory elements in noncoding sequences. Moreover, as both mammals and F. rubripes have been evolving separately for an estimated 450 million years, conservation of sequence is likely to be the result of functional constraints. As such, high-stringency comparisons are ensured. Here, we compared genomic sequences from human, mouse, and F. rubripes to learn more about NR2E1. The locus is also of interest because it may contain the allele responsible for the progressive and fatal nephritis observed in mice homozygous for the kidney disease (kd) allele [10]. In addition to NR2E1 and its putative regulatory regions, two known

45

Article

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

FIG. 1. Gene order and orientation at the NR2E1 locus are conserved in vertebrates, despite almost 10-fold compression in F. rubripes. The schematic represents an overview of the locus in human, mouse, and F. rubripes, with the inset at bottom center illustrating F. rubripes sequence to mammalian scale. Arrowheads preceded by a dotted line indicate genes that extend beyond the region depicted. The inset at bottom left illustrates the total repeat content in each of the three species, as well as the relative contribution of LINES, SINEs, and other repeats.

genes (HSPC019 and sorting nexin-3, SNX3) and two novel genes identified here were studied. A presumed noncoding RNA represented by clone 222389 (c222389) was seen to contain regions conserved among vertebrates. A novel vertebrate gene conserved in both prokaryotes and invertebrates was found to be differentially expressed in lactating breast and, based on this finding, it was named lactation elevated-1 (LACE1). Sequence near NR2E1 and conserved between human and mouse was also present proximal to other genes involved in brain development, defining putative regulators of brain transcription (pRBTs). Finally, two highly conserved noncoding elements (CEs), CE-17 and CE-19 within intron 1 of NR2E1, are common to all three species and thus likely important for NR2E1 regulation.

RESULTS AND DISCUSSION Conservation of Gene Order and Orientation Despite Tenfold Compression in F. rubripes The organization of the locus is identical between human, mouse, and F. rubripes, with conservation of both gene order and orientation (Fig. 1). Although the mouse sequence reported here does not extend to Hspc019, a recent report has shown that this gene maps to the Nr2e1 locus between D10Mit55 and D10Mit255 in mouse (see UniSTS 703425 [11]), supporting conservation of synteny in all three species. As reported previously for other F. rubripes loci [8], the F. rubripes NR2E1 locus is compact relative to its mammalian counterparts. The 252.4 kb in the human sequence between the terminal HSPC019 exon and the first LACE1 exon (GenBank acc. nos. BK000461 and AF520418, respectively) spans only 27.9 kb in F. rubripes (AF461063). The compression in F. rubripes is due to shortened introns, highly reduced intergenic regions, and a general reduction in repetitive elements relative to other

46

vertebrates [7,12]. Availability of mouse genomic sequence for the region will permit amplification and sequencing of individual exons, which in turn may help identify the kd allele or narrow down the critical interval containing the mutation. Proportion of Interspersed Repeat Sequence Reduced in Mouse Relative to Human Large-scale human–mouse comparisons have found a reduction in the proportion of interspersed repeats in the mouse genome relative to human [13,14]. To determine whether this was true at NR2E1, we used RepeatMasker (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker) in conjunction with species-specific Repbase libraries [15] to mask sequence from human (212.8 kb) and mouse (177.7 kb, GenBank acc. no. AF520420) between CE-5A (5⬘ of the second c222389 exon) and exon 3 of LACE1. We determined that interspersed repeats were present at significantly greater levels in human (48.0%) compared with mouse (30.5%; ␹2, P < 0.0001; Fig. 1). The number of LINES and SINES were similar, but elements were longer in human (data not shown). Conservation Clusters at NR2E1 and Contains Elements Homologous to F. rubripes Sequence We identified 47 conserved elements present in human and mouse (CE-As; Fig. 2). Neither ESTs nor gene prediction software (NIX, http://www.hgmp.mrc.ac.uk/NIX/; GenomePipeLine, http://compbio.ornl.gov/tools/pipeline/) provided strong evidence that any were part of known or novel genes. The average distance between CE-As (one per 3.9 kb of human sequence) was similar to a multi-locus estimate of one per 6.9 kb we determined by pooling data from several studies [5,14,16,17]. CE-As were not distributed evenly across the region, but rather were clustered around NR2E1 (Fig. 3). Relative conservation was ascertained by dividing the locus into segments on the basis of intra- and intergenic

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

Article

FIG. 2. Conserved elements (CEs) can be identified by human–mouse–F. rubripes sequence comparison. Coordinates for both CE-As identified by human–mouse comparison (minimum of 70% identity across at least 100 ungapped bp) and CE-Bs identified by human–F. rubripes comparison (minimum of 65% identity across at least 100 bp, with individual gaps < 20 bp) are given. The mouse sequence reported here does not cover any of region I, II, or III. “Begin” and “End” points refer to the human sequence examined, with 1 equivalent to bp 403,724 of NT_026302.4. Gaps permitted within CE-Bs resulted in slightly different lengths between human and F. rubripes. Lengths reported are from human.

sequence, and numbering adjacent segments sequentially from the 5⬘ end (Fig. 3). Although regions V (immediately 5⬘ to NR2E1) and VI (flanked by NR2E1 exons) represent less than 14% of the sequence we compared, they contain 48% of CE-As at the locus (␹2, P < 0.0001). Furthermore, less than 12% of the sequence in V or VI contains interspersed repeats in either mammal, as compared with the species-specific averages of 48% and 30.5% for human and mouse, respectively (␹2, P < 0.0001). As argued for the four HOX gene clusters [18], the high degree of conservation at NR2E1 in conjunction with an apparent lack of tolerance for interspersed repeats suggests the presence of an imbedded array of cis regulatory elements required for proper gene function. Seven elements were common to human and F. rubripes (CE-Bs), and unlike any annotated sequence within GenBank (Fig. 2). Although CE-1B through -4B are beyond the 5⬘ end of the mouse sequence we generated, CE-5B through -19B overlap with CE-As and support the hypothesis that elements conserved between human and F. rubripes are conserved between vertebrates. The lack of any conserved reading frame within CE-5B, -17B, and -19B makes these regions of particular interest, as it suggests that conservation across these elements may have been maintained in order to preserve regulatory elements required for gene expression. Novel Coding Exons and Conserved Domains in the Human Gene HSPC019 Comparison of the mouse Hspc019 cDNA (AK004546 [11]), a human EST (AV728049), and the human genomic sequence for the locus (NT_026302.4) revealed a frameshift in the human HSPC019 cDNA (AF077205.1 [19]). After correcting this error, we were able to identify two additional human exons and extend the protein by 208 amino acids at its amino terminus (GenBank acc. no. BK000461). Although the human

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

EST (AV728049) does not contain a start codon, examination of genomic DNA shows an open reading frame at its 5⬘ end. This open reading frame is not, however, conserved in either mouse or F. rubripes. Using BLAST to search the non-redundant (NR) protein database for sequences similar to human HSPC019, we identified both the CG14969 gene product (AAF47763.11) from Drosophila melanogaster and the F42A8.3 gene product (Q09322) from Caenorhabditis elegans. Although database searches (CDD, http://www.ncbi.nlm.nih.gov/Structure/ cdd/cdd.shtml; PFAM, http://pfam.wustl.edu) failed to identify any known protein domains, pattern recognition software (MEME, http://meme.sdsc.edu/meme/website/) led us to an 18-residue motif (CnpDnp..MN.T..W...npC) expected to occur by chance at a rate of 1 in 1.25 e–20. A second region N-terminal to a predicted outside-to-inside transmembrane domain [20] in each of human, mouse, and F. rubripes was identical (FYLSSFLHSEQKKRKLI) among vertebrates. Both regions likely represent novel functional domains described here for the first time. Eye-Derived c222389 Contains Conserved Noncoding Regions Although no computationally predicted genes between

47

Article

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

FIG. 3. Conserved sequence elements are not distributed evenly across the region but cluster proximal to and within the mammalian NR2E1 locus. Regions, defined by gene position, are as follows: IV, within c222389; V, between c222389 and NR2E1; VI, within NR2E1; VII, between NR2E1 and SNX3; VIII, within SNX3; IX, between SNX3 and LACE1; X, within LACE1. Regions I, II, and III do not appear, as these regions were not sequenced in mouse.

HSPC019 and NR2E1 from human and F. rubripes overlapped, end-reads from IMAGE clone 222389 (acc. nos. H86089 and H86429) mapped to the region and contained sequence conserved between all three species. Attempts to amplify the transcript from oligodT primed cDNA prepared from each of human retina, mouse eye, and F. rubripes eye failed (data not shown), but clones from separate human cDNA libraries (GenBank acc. nos. AA020885, AA020773, and AA903843) support the hypothesis that the region within the clone is indeed transcribed. Sequencing of c222389 (GenBank acc. no. AF520419) showed 1) the cDNA insert to be 2 kb, 2) canonical splicing relative to genomic DNA, and 3) conservation between each of human, mouse, and F. rubripes across two separate regions.

Percent identities across these regions were high between human and mouse (88% and 96% over > 100 ungapped bp) and only slightly lower between human and F. rubripes (78% and 95% over 77 and 58 ungapped bp). We found 17 short (6 to 198 bp), ATG-containing open reading frames within the human clone, but none of these were similar to open reading frames in the corresponding mouse or F. rubripes sequence. Two additional open reading frames (60 and 66 bp) are conserved between all three species but do not contain a start methionine. It is not known if the 2-kb insert we sequenced represents a full-length clone. The c222389 gene cannot be the 5⬘-UTR from NR2E1 given published data regarding the latter’s transcript length and transcriptional start [21]. Because it is not known whether the

FIG. 4. Alignment of vertebrate NR2E1 protein homologs shows strong conservation across species. Arrowheads connected by vertical lines indicate conserved exon boundaries observed in Homo sapiens (human), Mus musculus (mouse), and Fugu rubripes (pufferfish). Use of a distinct intron 1 splice acceptor site in fish, F. rubripes and Oryzias latipes (Japanese Medaka), results in the inclusion of 11 additional amino acids. Gallus gallus (chicken) and Xenopus laevis (African clawed frog) are included for comparison. DNA and ligand binding domains are shaded dark and light, respectively.

48

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

FIG. 5. NR2E1 in F. rubripes is strongly transcribed from brain and eye alone. RT-PCR products were transferred to a nylon membrane and probed with a cDNA for F. rubripes NR2E1. To control for cDNA quality a F. rubripes actin fragment was amplified and analyzed as above.

cDNA is full length, c222389 may be the 3⬘-UTR from a novel protein-coding gene, but the high degree of conservation makes this unlikely. We hypothesize that c222389 represents some or all of a noncoding RNA expressed transiently in response to particular, and as yet unknown, physiological stimuli. The conserved regions within the cDNA define noncoding elements likely to be of functional importance. NR2E1 Protein and Transcript Distribution Are Conserved among Vertebrates F. rubripes NR2E1 is 87% identical to both mouse and human across its length (Fig. 4). Differences cluster within the protein’s hinge region, located between the molecule’s DNA- and ligand-binding domains. NR2E1 in both F. rubripes and Oryzias latipes (acc. no. CAB38085.1) contains 11 additional amino acid residues (Fig. 4). Although it is not clear whether these residues are the result of recruitment in the fish lineage or loss in the mammalian lineage, this sequence seems to be unique to fish. Following identification of F. rubripes NR2E1, we sought to determine whether transcript distribution was conserved among vertebrates, as this would support the hypothesis that regulatory mechanisms were similarly conserved. We examined 13 F. rubripes tissues by RT-PCR (Fig. 5) and, as observed previously for mammals and chick [3,22], found NR2E1 at high levels in brain and eye alone. These results support conserved regulatory mechanisms at NR2E1 between human, mouse, and F. rubripes. Discovery of Potential Vertebrate Conserved Regulatory Elements for NR2E1 Positioning of c222389 was informative, as it limited the distance between NR2E1 and its nearest 5⬘ neighbor to just over 8.0 kb in human and mouse, and only 3.5 kb in F. rubripes. This information is important because it likely defines the basal NR2E1 promoter. We used TESS (http://www.cbil. upenn.edu/tess/index.html) to examine conserved elements

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

Article

proximal to the transcriptional start site (CE-13A, -14A, -17A, and -19A, and the 5⬘-UTR of exon 1; Figs. 2 and 6A) for DNA motifs known to bind vertebrate proteins. Numerous potential binding sites conserved in position and orientation were observed between mouse and human. A much smaller number of binding sites were conserved when all three species were considered (CE-17 and CE-19; Fig. 6B). CE-17 contains three binding sites (each with identical position and orientation) for the paired related homeobox gene-2 (Prrx2). As a member of the paired-type homeo class of factors known to interact with tailless, the Drosophila homolog of NR2E1, the presence of potential binding in a conserved region within NR2E1 is interesting. Although the Prrx2 transcript has not been found in mammalian brain [23,24], the gene may be present in eye. As well, other brain-expressed members of the Prrx family may bind to the Prrx2 sites we identified. Comparison of CE-19B in human, mouse, and F. rubripes identified conserved sites for 14 distinct factors. BLAST searches against the human genome indicated that sequences similar to portions of CE-10A (located upstream of NR2E1; Fig. 6A) were present at multiple loci, proximal to genes implicated in brain development. Based on this finding, we have identified these elements as putative regulators of brain transcripts (RBTs). Putative RBTs were observed within multiple human genes: intron 1 of neurexophilin-1 (GenBank acc. no. AC004613.1), intron 7 of neurexin-3␣ (GenBank acc. no. AC012099.4), and flanking exon 1 of FLJ11598 (GenBank acc. nos. AC018900.8 and AK021660), an uncharacterized gene similar to the mouse semaphorin VIa gene (GenBank acc. no. AF030430). Although little is known about the neuron-specific ␣-neurexins, the ␤-isoforms are involved in synapse formation [25]. Neurexophilin-1 binds neurexin-3␣ and is thought to represent a neurexin ligand [26]. Less is known about FLJ11598, but based on its presence in brain (Unigene Hs.191098) and similarity to semaphorin VIa, it may act as an axonal guidance molecule [27]. The localization of these putative RBTs suggests that these sequences may define novel regulatory elements important for a subset of mammalian brain expressed genes. Examination of CE-10A for transcription factor binding sites with TESS revealed multiple conserved GATA binding protein sites. Given the periventricular localization of both Gata2 and Gata3 transcripts and their proposed role in early neural differentiation [28], the presence of these binding sites suggests that neurexin-3, neurexophilin-1, FLJ11598, and NR2E1 are regulated in concert by GATA family members. Furthermore, this region contained a pair of conserved nonoverlapping sites for Prrx2, lending support for its possible role in the regulation of NR2E1. Limited Conservation at SNX3 outside of Protein-Coding Regions Like NR2E1, the SNX3 protein is highly conserved, with almost 99% identity between the human (AF034546.1) and mouse (NM_017472.1), and 84% identity between mammals and F. rubripes (AF461063). In contrast to NR2E1, however,

49

Article

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

FIG. 6. The NR2E1 locus is densely packed with conserved elements, and has few repeats, suggesting a series of embedded regulatory elements. (A) We compared human sequence from the NR2E1 locus (bp 575,726–615,725 of NT_026302.4) with the corresponding region from mouse (bp 21,637–58,750 of bEMS4) with PipMaker. Short horizontal black lines represent percent identity between the two masked sequences. Blue, red, and green rectangles represent respectively, known coding exons (e), CE-As (minimum of 70% identity between human and mouse across at least 100 ungapped bp), and CE-As seen to overlap with CE-Bs (minimum of 65% identity between human and F. rubripes across at least 100 bp, with individual gaps < 20 bp). Repeat features and CpG islands (as per key) correspond to the human sequence. (B) CE-17 and 19, in intron 1 of NR2E1, are conserved in vertebrates. PRRX2 binding sites in CE-17 are indicated with dotted lines; red and blue correspond to the + and – strands, respectively. It is likely that the regions defined as common to all three species are important for NR2E1 regulation.

intronic conservation was limited to a small cluster of CE-As within intron 1 (Fig. 2). Although three of these elements, CE35A, -37A, and -38A, matched ESTs from colon (AW841298 and AW848035), none was homologous to any sequence within F. rubripes. Identification of LACE1, a Novel Vertebrate Gene Upregulated during Lactation ESTs from human retinoblastoma (BE780155) and mouse mammary tumor (BF179793) mapped to the region, showed splicing relative to genomic DNA, and overlapped with sequence conserved between both mammals and F. rubripes. Sequencing of the corresponding IMAGE clones (GenBank acc. nos. AF520418 and AF520417, respectively) identified a conserved

50

open reading frame. A second mouse EST (AI156858) was subsequently found to contain the start methionine absent from clone 4037233 as well as a possible alternative transcript. 5⬘RACE and RT-PCR on F. rubripes cDNA identified the F. rubripes transcription start site as well as exons 1 through 7, all of which are present on cosmid 48H12. Alignment of the human cDNA against the genomic locus (NT_026302.4) uncovered 13 exons across 228 kb of sequence. Exons 12 and 13 (bp 1406-2270) were subsequently observed to have been predicted by Ensembl v 1.0 (http://www. ensembl.org/) and are represented by ENSG00000112341. Based on the differential transcript distribution we observed (see below), we named this novel gene lactation elevated-1 (LACE1).

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

Article

FIG. 7. LACE1 homologs in vertebrates, invertebrates, fungi, and prokaryotes share a common five-domain structure. The schematic depicts protein domains conserved throughout evolution; color intensity indicates percent identity to human as per key at bottom. Species name (followed by GenBank acc. nos.) and predicted molecular weights (in kilodaltons, kDa) appear above each protein. Below the H. sapiens, D. melanogaster, S. cerevisiae, and E. coli proteins, the amino acids numbers comprising each domain are indicated.

LACE1 Homologs Share a Five-Domain Structure BLAST searches with LACE1 against NR did not identify any similar vertebrate proteins, but did identify non-vertebrate proteins from fly, yeast, and bacteria. Each protein contains an ATP/GTP binding P-loop [29], a shared fivedomain structure, and is predicted to be an ATPase (Fig. 7). Percent identity ranged from 28% (Saccharomyces cerevisiae versus Vibrio cholerae) to 88.9% (human versus mouse), with each of the non-vertebrate homologs showing between 30% and 50% identity to the human protein. Each homolog is more similar to the human LACE1 protein than to other ATPases from its own species. Only the Saccharomyces cerevisiae homolog AFG1/YEL052 has been studied experimentally, but two separate reports using distinct approaches have demonstrated that disruption of this ORF results in a reduced growth rate relative to wild-type strains [30,31].

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

Lace1 Is Elevated during Lactation and Alternatively Spliced Relative transcript levels in multiple mouse tissues were assessed by semi-quantitative RT-PCR on serially diluted cDNA templates. Although Lace1 was observed in each of the 24 tissues examined, transcript levels varied by as much as 100-fold (data not shown). Highest levels were seen in heart, kidney, and lactating breast (Fig. 8). Lace1 was present at reduced levels in virgin, pregnant, and involuting breast, suggesting that it may be associated with physiological changes that occur in lactation. We examined the 2.5 kb surrounding human and mouse LACE1 exon 1 for progesterone, estrogen, and glucocorticoid receptor binding sites and identified a pair of conserved estrogen receptor binding sites. These data suggest that LACE1 may be hormonally regulated. RT-PCR also provided empirical evidence for the alternate (short) Lace1 transcript, predicted by comparison of ESTs to lack exon 3. The absence of exon 3 (57 bp) does not disrupt the Lace1 open reading frame but does shorten the protein by 19 amino acids. Fourteen of these residues are within the second conserved protein domain (positions 126–139, Fig. 7) we identified, raising the hypothesis that the structural differences between the alternative transcripts underlie functional differences between them. The short isoform is present at low levels in most tissues, but is upregulated during embryonic development and appears at levels comparable to the longer transcript in adult brain. The potential for equivalent alternative transcripts exists in both human and F. rubripes. The results presented here show that inter-species comparison is an effective strategy for discovering new genes and conserved regions within them likely to be of functional significance. Although we generated and subsequently examined 187 kb of mouse sequence in the course of this study, only 4.3% of this met our criteria for a CE-A, limiting the regions likely to be of functional significance by almost 20fold. Without this comparative approach more than 100 kb of sequence would have required investigation by standard experimental approaches, even after discarding the sequence defined by RepeatMasker as repetitive. F. rubripes provided an additional degree of stringency. Because only 0.45% of human sequence met the criteria for a CE-B, sequence of interest was limited an additional 10-fold, increasing ultimate specificity 200 times. Moreover, examination of the locus in three distinct species guarded against false positives that may have arisen in the study of only two. Given the cost and time associated with in vivo experiments, inter-species comparison can assist in the quick determination of the experimental targets most likely to be of biological significance.

51

Article

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

A

MATERIALS AND METHODS Isolation of mouse and F. rubripes clones for the NR2E1 locus. We used oEMS296, 5⬘-CTCCCAGCAATCTAGTTTCCC-3⬘, and oEMS298, 5⬘-CTCTAGCAAAACTGCAGCTGC-3⬘, against mouse Nr2e1 (S77482) to identify bEMS4 (551D16) from the 129 mouse CITB BAC library [38]. We used degenerate primers oEMS1610 (FMSIKWA, 5⬘-TTYATGAGYATHAARTGGGC-3⬘) and oEMS1611 (QWAIPVD, 5⬘-TCNACNGGHATNGCCCAYTG-3⬘) against exons 5 and 6 of vertebrate NR2E1 homologs, respectively, to isolate a F. rubripes probe and screen a F. rubripes cosmid library (G. Elgar, UK-HGMP Resource Center). Two positive clones, 48H12 and 117N12, were determined by restriction mapping to be identical. We sequenced 48H12.

B

FIG. 8. In mouse, Lace1 transcript levels in lactating breast, heart, and kidney are as much as 100 times greater than in other tissues, as demonstrated by semi-quantitative RT-PCR performed in duplicate at each of 1, 10, 100, and 1000 pg templates. (A) At 1000 pg, two distinct Lace1 isoforms are visible in a subset of tissues. (B) RT-PCR with ␤-actin primers was performed under identical conditions to control for template amount. M, marker.

The extent of noncoding conservation within a gene seems to be unrelated to conservation with its coding regions, a finding highlighted here by disparate results obtained for NR2E1 and SNX3. Although the two proteins are similarly conserved, only NR2E1 showed any conservation within F. rubripes noncoding regions. Conservation at some loci but not others could simply reflect differences in evolutionary age, but an interesting alternative is that evolution maintains conservation at a subset of loci in order to preserve regulatory elements distributed within them. Colony stimulating factor-1 [32], the osteoprotegerin ligand (tumor necrosis factor ligand superfamily, member 11) [33], patched-1 [34], and prolactin [35] are all differentially regulated during lactation, and essential for normal lactation [36,37]. Given that LACE1 is similarly regulated, it too may prove essential for normal lactation. Whether or not this is true for LACE1, the fact that the protein is present across multiple kingdoms and at the same time distinct from the most closely related ATPases suggests a unique role for this novel gene. The sequence data generated here will provide a valuable resource for study of the locus, and may prove important in the determination of the gene responsible for the fatal nephritis observed in mice homozygous for the kd allele. The identification of conserved elements between human, mouse, and F. rubripes provides an excellent starting point for the determination of the full range of sequence elements required for correct temporal–spatial gene expression of NR2E1.

52

Mouse and F. rubripes sequence generation. We generated mouse (GenBank acc. no. AF520420) and F. rubripes (GenBank acc. no. AF461063) sequence by a combination of shotgun sequencing and primer walking, using automated DNA sequencers. Regions that could not be sequenced with Big Dye Terminator chemistry were resolved with dGTP Big Dye chemistry (Applied Biosystems, Foster City, CA). dGTP reads were always re-sequenced from the opposite direction with a Big Dye Terminator read. We confirmed BAC sequence assembly (Sequencher, Gene Codes, Ann Arbour, MI) by multiple long PCRs using the Expand Long PCR Template System (Roche Diagnostics, Laval, PQ). We generated 40 kb of contiguous sequence for the F. rubripes cosmid 48H12 from a total insert of 45 kb; the additional unsequenced material is 3⬘ to the sequence reported here. IMAGE clones were obtained from Incyte Genomics (Palo Alto, CA) and Research Genetics (Huntsville, AL) and sequenced as above. Sequence analysis. We used a portion of NT_026302.4 (bp 403,724–843,699), made up of finished and unfinished sequence [18], to represent the human region. The interval was chosen on the basis of NT_007577.1, at which time it spanned bp 370,000–810,000. Sequence was masked for repeats with RepeatMasker, run in high sensitivity mode in conjunction with species-specific Repbase libraries ([15], Volume 6, Issue 3, Version 05152001). Masked sequence was compared against each of NR, dBEST, GSS, HTGS, and PAT with one or more of BLASTN, BLASTP, BLASTX, and TBLASTX [39]. Excluding expect thresholds, which were lowered to 0.1 (DNA) or 0.001 (Protein), NCBI defaults were used. We used BLAST2Sequences (http://www.ncbi.nlm. nih.gov/gorf/bl2.html) and PipMaker (http://bio.cse.psu.edu) for comparative analyses, ignoring elements not found in the same relative position in all species. To identify recognized and novel protein domains, we used MaxHom within PredictProtein (http://www.embl-heidelberg.de/predictprotein/predictprotein.html), the CDD database (http://www.ncbi.nlm.nih. gov/Structure/cdd/cdd.shtml), PFAM (http://pfam.wustl.edu), and MEME (http://meme.sdsc.edu/meme/website/). All ␹2 analyses were done on basepair counts with the threshold for significance set at P = 0.0001. Exon–intron structure of F. rubripes genes and determination of transcript distribution. RNA was isolated from F. rubripes tissues using TriZol (Invitrogen, Burlington, ON). RNA (1 ␮g) was used to synthesize 5⬘- and 3⬘-RACE-ready cDNA using the SMART RACE cDNA Amplification Kit (Clontech, Palo Alto, CA). Transcriptional start sites and exon-intron boundaries for F. rubripes HSPC019, NR2E1, SNX3, and LACE1 were determined by cloning and sequencing 5⬘- and 3⬘-RACE products obtained with 24-mers (HSPC019, HSF1, 5⬘CAGAAAGAGCATGAAGCTGGACAC-3⬘, HSF2, 5⬘-TTGAAGCTTGTGCTCCAGAGTCTG-3⬘, HSR1, 5⬘-TGTGTGTATCGACATGGAGGATTC-3⬘, HSR2, 5⬘-ACTTGAGCAGCTTCCTTCACTCAG-3⬘; NR2E1, TLR1, 5⬘TCAAACGGCACGCTCGGCACTG-3⬘, TLR2, 5⬘-GTTTCTGTGCGTCTTGTCGACC-3⬘, TLF1, 5⬘-GATGCCACAGAATTTGCCTGCCTG-3⬘, TLF2, 5⬘-AAATGCATCGTGACCTTCAAAGC-3⬘; SNX3, NXF1, 5⬘-TGGCAGGAACAGATATACCAC-3⬘, NXF2, 5⬘-GCTGAAGGAGTCTTGTGTGCAG-3⬘, NXR1, 5⬘ACTTGTTGAGAAACTGCTCCAGAC-3⬘, NXR2, 5⬘-ACTGCCGGAACAGAGCTTTTCCTG-3⬘; LACE1, LACER1, 5⬘-CGTTGAAGTGGACTCTCTTCTTG-3⬘, LACER2, 5⬘-AACATGTCCATGAGCATGGTCTTC-3⬘, LACEF1, 5⬘-TATCGCCGGTTGCCGTGGAGATAG-3⬘, LACEF2, 5⬘-GCAACGAAACCTGCCTGCTCTGC-3⬘) used in conjunction with adapter primers in nested PCR. For F. rubripes NR2E1 transcript analysis, 1/100 of the RACE-ready cDNA served as template for PCR with oEMS1612, 5⬘-CTCAACAAGTGGCT-

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

doi:10.1006/geno.2002.6795, available online at http://www.idealibrary.com on IDEAL

GTCTGGA-3⬘, and oEMS1613, 5⬘-GTTTCTGTGCGTCTTGTCGACC-3⬘, complementary to exons 1 and 3, respectively. PCR products (220 bp) were size fractionated and transferred to Hybond-N nylon membrane (Amersham Biosciences, Piscataway, NJ), then hybridized with a F. rubripes NR2E1 cDNA (pEMS753) to confirm their identity. To control for cDNA quality, oEMS1614 (5⬘-AACTGGGACGACATGGAGAA-3⬘) and oEMS1615 (5⬘-TTGAAGGTCTCAAACATGAT-3⬘) were used to amplify F. rubripes ACTIN and the product (152 bp) analyzed as above (data not shown). pBSfACT, a F. rubripes actin cDNA, was used as a probe. Lace1 levels were assayed with the 24 tissue mouse Rapid Scan Panel (Origene, Rockville, MD). PCR (with 10% DMSO) was carried out in duplicate using oEMS1429, 5⬘-GAAAGCCTTGGCTGTTTGC-3⬘, and oEMS1390, 5⬘GTGCACGTCCAGCATGAA-3⬘, against exons 2 and 4. PCR generated two bands (316 and 259 bp, respectively), one for each of the Lace1 transcripts. Semiquantitative RT-PCR was achieved for each of the 24 mouse tissues by carrying out duplicate reactions with serially diluted cDNA template at each of 1, 10, 100, or 1000 pg. To demonstrate equal cDNA levels between tissues, ␤-actin primers (Origene ␤-actin forward, 5⬘-GCATGGGTCAGAAGGAT-3⬘, and reverse, 5⬘-CCAATGGTGATGACCTG-3⬘) were used to amplify a 570-bp fragment from cDNAs on an identical plate prepared in parallel to the one used for Lace1.

ACKNOWLEDGMENTS We thank Xiao Hua Han (CMMT), Michele Karolak, and Tiffany Leidy (JAX) for BAC sequencing; Ruby Gill (Simpson Lab), B. H. Tay, and Diane Tan (IMCB) for technical assistance; Rosemary Oh and Ed Chan (Hayden Lab) for technical advice; Rebecca Devon (Hayden and Simpson Labs) and Farkhad Muratkhojaev (Simpson Lab) for insightful discussion; Tracey Weir for both insightful discussion and invaluable assistance in manuscript preparation (Simpson Lab); Dora Pak (Simpson Lab) for administrative support; Miroslav Hatas (CMMT) for software support; and Sohrab Shah (Ouellette Lab) for indispensable advice regarding computational analyses. We thank the UK-HGMP resource center for providing the F. rubripes cosmids. B.V. is an adjunct staff of the Department of Pediatrics, Faculty of Medicine, National University of Singapore. This work was funded by the following grants: Canadian Institute of Health Research Doctoral Research Award (B.S.A.), National Institute of Health Mental Health 1RO1MH/HD57465 (E.M.S.), Canada Research Council Chair in Genetics & Behaviour 950-01-193 (E.M.S.), and a joint Industry Canada-Canadian Institute of Health Research-Government of British Columbia-National Science and Technology Board of Singapore award. RECEIVED FOR PUBLICATION OCTOBER 31, 2001; ACCEPTED APRIL 15, 2002.

REFERENCES 1. Pignoni, F., et al. (1990). The Drosophila gene tailless is expressed at the embryonic termini and is a member of the steroid receptor superfamily. Cell 62: 151–163. 2. Young, K. A., et al. (2002). Fierce: a new mouse deletion of Nr2e1; violent behaviour and ocular abnormalities are background-dependent. Behav. Brain. Res. 132: 145–158. 3. Yu, R. T., et al. (2000). The orphan nuclear receptor Tlx regulates Pax2 and is essential for vision. Proc. Natl. Acad. Sci. USA 97: 2621–2625. 4. Monaghan, A. P., et al. (1997). Defective limbic system in mice lacking the tailless gene. Nature 390: 515–517. 5. Loots, G. G., et al. (2000). Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288: 136–140. 6. Venkatesh, B., Si-Hoe, S. L., Murphy, D., and Brenner, S. (1997). Transgenic rats reveal functional conservation of regulatory controls between the Fugu isotocin and rat oxytocin genes. Proc. Natl. Acad. Sci. USA 94: 12462–12466. 7. Brenner, S., et al. (1993). Characterization of the pufferfish (Fugu) genome as a compact model vertebrate genome. Nature 366: 265–268. 8. Venkatesh, B., Gilligan, P., and Brenner, S. (2000). Fugu: a compact vertebrate reference genome. FEBS Lett. 476: 3–7. 9. Elgar, G., et al. (1996). Small is beautiful: comparative genomics with the pufferfish (Fugu rubripes). Trends Genet. 12: 145–150. 10. Dell, K. M., Li, Y. X., Peng, M., Neilson, E. G., and Gasser, D. L. (2000). Localization of

11.

12. 13. 14. 15. 16.

17. 18. 19.

20. 21.

22. 23.

24.

25. 26.

27. 28.

29. 30.

31. 32. 33. 34. 35. 36. 37.

38.

39.

Article

the mouse kidney disease (kd) gene to a YAC/BAC contig on Chromosome 10. Mamm. Genome 11: 967–971. Kawai, J., et al. (2001). Functional annotation of a full-length mouse cDNA collection. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium. Functional annotation meeting. Nature 409: 685–690. Elgar, G., et al. (1999). Generation and analysis of 25 Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning. Genome Res. 9: 960–971. Dehal, P., et al. (2001). Human chromosome 19 and related regions in mouse: conservative and lineage-specific evolution. Science 293: 104–111. Onyango, P., et al. (2000). Sequence and comparative analysis of the mouse 1-megabase region orthologous to the human 11p15 imprinted domain. Genome Res. 10: 1697–1710. Jurka, J. (2000). Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16: 418–420. Shiraishi, T., et al. (2001). Sequence conservation at human and mouse orthologous common fragile regions, FRA3B/FHIT and Fra14A2/Fhit. Proc. Natl. Acad. Sci. USA 98: 5722–5727. Mallon, A. M., et al. (2000). Comparative genome sequence analysis of the Bpa/Str region in mouse and Man. Genome Res. 10: 758–775. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409: 860–921. Zhang, Q. H., et al. (2000). Cloning and functional analysis of cDNAs with open reading frames for 300 previously undefined genes expressed in CD34+ hematopoietic stem/progenitor cells. Genome Res. 10: 1546–1560. Hofmann, K., and Stoffel, W. (1993). TMbase—A database of membrane spanning proteins segments. Biol. Chem. 374: 166. Monaghan, A. P., Grau, E., Bock, D., and Schütz, G. (1995). The mouse homolog of the orphan nuclear receptor tailless is expressed in the developing forebrain. Development 121: 839–853. Yu, R. T., McKeown, M., Evans, R. M., and Umesono, K. (1994). Relationship between Drosophila gap gene tailless and a vertebrate nuclear receptor Tlx. Nature 370: 375–379. Norris, R. A., et al. (2000). Human PRRX1 and PRRX2 genes: cloning, expression, genomic localization, and exclusion as disease genes for Nager syndrome. Mamm. Genome 11: 1000–1005. Leussink, B., et al. (1995). Expression patterns of the paired-related homeobox genes MHox/Prx1 and S8/Prx2 suggest roles in development of the heart and the forebrain. Mech. Dev. 52: 51–64. Missler, M., Fernandez-Chacon, R., and Sudhof, T. C. (1998). The making of neurexins. J. Neurochem. 71: 1339–1347. Missler, M., Hammer, R. E., and Sudhof, T. C. (1998). Neurexophilin binding to ␣-neurexins. A single LNS domain functions as an independently folding ligand-binding unit. J. Biol. Chem. 273: 34716–34723. Raper, J. A. (2000). Semaphorins and their receptors in vertebrates and invertebrates. Curr. Opin. Neurobiol. 10: 88–94. Nardelli, J., Thiesson, D., Fujiwara, Y., Tsai, F. Y., and Orkin, S. H. (1999). Expression and genetic interaction of transcription factors GATA-2 and GATA-3 during development of the mouse central nervous system. Dev. Biol. 210: 305–321. Saraste, M., Sibbald, P. R., and Wittinghofer, A. (1990). The P-loop—a common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15: 430–434. Smith, V., Chou, K. N., Lashkari, D., Botstein, D., and Brown, P. O. (1996). Functional analysis of the genes of yeast chromosome V by genetic footprinting. Science 274: 2069–2074. Winzeler, E. A., et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285: 901–906. Sapi, E., and Kacinski, B. M. (1999). The role of CSF-1 in normal and neoplastic breast physiology. Exp. Biol. Med. 220: 1–8. Fata, J. E., et al. (2000). The osteoclast differentiation factor osteoprotegerin-ligand is essential for mammary gland development. Cell 103: 41–50. Lewis, M. T., et al. (1999). Defects in mouse mammary gland development caused by conditional haploinsufficiency of Patched-1. Development 126: 5181–5193. Neville, M. C., Morton, J., and Umemura, S. (2001). Lactogenesis. The transition from pregnancy to lactation. Pediatric Clinics North Am. 48: 35–52. Falk, R. J. (1992). Isolated prolactin deficiency: a case report. Fertil. Steril. 58: 1060–1062. Pollard, J. W., and Hennighausen, L. (1994). Colony stimulating factor 1 is required for mammary gland development during pregnancy. Proc. Natl. Acad. Sci. USA 91: 9312–9316. Shizuya, H., et al. (1992). Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor based vector. Proc. Natl. Acad. Sci. USA 89: 8794–8797. Altschul, S. F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402.

Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under accession numbers AF461063 (F. rubripes cosmid 48H12), AF520420 (mouse BAC 551D16), BK000461 (human HSPC019 cDNA), AF520419 (human c222389 cDNA), AF520418 (human LACE1 cDNA), AF520417 (mouse Lace1 cDNA).

GENOMICS Vol. 80, Number 1, July 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved.

53