Structural evolution of the BRCA1 genomic region in primates

Structural evolution of the BRCA1 genomic region in primates

Genomics 84 (2004) 1071 – 1082 www.elsevier.com/locate/ygeno Structural evolution of the BRCA1 genomic region in primates Hong Jin1, Joanna Selfe1,2,...

568KB Sizes 1 Downloads 56 Views

Genomics 84 (2004) 1071 – 1082 www.elsevier.com/locate/ygeno

Structural evolution of the BRCA1 genomic region in primates Hong Jin1, Joanna Selfe1,2, Caroline Whitehouse, Joanna R. Morris, Ellen Solomon*, Roland G. Roberts Division of Medical & Molecular Genetics, GKT Medical School, King’s College, London SE1 9RT, UK Received 30 April 2004; accepted 25 August 2004 Available online 12 October 2004

Abstract Segmental duplications account for up to 6% of the human genome, and the resulting low-copy repeats (LCRs) are known to be associated with more than 20 genomic disorders. Many such duplication events coincided with the burgeoning of the Alu repeat family during the last 50 million years of primate evolution, and it has been suggested that the two phenomena might be causally related. In tracing the evolution of the BRCA1 17q21 region through the primate clade, we discovered the occurrence over the last 40 million years of a complex set of about eight large gene-conversion-mediated rearrangements in the ~4 Mb surrounding the BRCA1 gene. These have resulted in the presence of large and probably recombinogenic LCRs across the region, the creation of the NBR2 gene, the duplication of the BRCA1/ NBR1 promoter, the bisection of the highly conserved ARF2 gene, and multiple copies of the KIAA0563 gene. The junctions lie within AluS repeats, members of an Alu subfamily which experienced massive expansion during the time that the rearrangements occurred. We present a detailed history of this region over a critical 40 million-year period of genomic upheaval, including circumstantial evidence for a causal link between Alu family expansion and the rearrangement-mediated destruction and creation of transcription units. D 2004 Elsevier Inc. All rights reserved. Keywords: BRCA1; NBR1; ADP ribosylation factors; Alu elements; Low-copy repeats; Segmental duplication; Gene conversion; Genomic disorders; Primate evolution

Introduction It has been estimated that approximately one-twentieth of the human genome has undergone large-scale duplication in the last 40 million years [1]. This has resulted in the generation of low-copy repeats (LCRs) which have the potential to mediate high-frequency nonallelic homologous recombination (NAHR) events, many of which have been implicated in human genetic disease [2,3]. The possibility also exists, however, that some of the segmental duplications have played an active part in our evolution [4–7],

* Corresponding author. Division of Medical & Molecular Genetics, GKT Medical School, 8th Floor, Guy’s Tower, Guy’s Hospital, London SE1 9RT, UK. Fax: +44 20 7188 2585. E-mail address: [email protected] (E. Solomon). 1 These authors contributed equally to this work. 2 Present affiliation: Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK. 0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2004.08.019

either by the duplication or disruption of existing genes, or by the creation of novel genes via the juxtaposition of previously distant material. The human BRCA1 gene encodes an 1863-amino acid ubiquitously expressed protein implicated in aspects of DNA repair and checkpoint control [8,9]. A well-conserved N-terminal RING domain and C-terminal BRCT motifs flank a large, poorly conserved central region of unknown function. BRCA1 exhibits characteristics of a tumor suppressor, in that heterozygosity for loss-offunction alleles at BRCA1 confers a high lifetime risk of breast and ovarian cancers (MIM 113705). The nearby NBR1 gene [10] encodes a 966-amino acid protein that is also ubiquitously expressed. It contains an N-terminal OPR domain, a ZZ domain, and a C-terminal UBA domain. We have shown that it interacts with two proteins [11], calcium and integrin-binding protein (CIB) and a novel protein kinase C-~-interacting protein (FEZ1), but the function of NBR1 remains unclear.

1072

H. Jin et al. / Genomics 84 (2004) 1071–1082

The orthologous murine Brca1 and Nbr1 genes are transcribed divergently from a single promoter region [12], with the 5V ends of the majority of ESTs and annotated full-length transcripts encoded by genomic sites a mere 200–940 bp apart [13]. In humans, however, a duplication of a ~15-kb region containing this promoter system has occurred, with the two copies of the promoter lying ~45 kb apart. The intervening sequence includes ~15 kb of material related to other loci elsewhere on chromosome 17. The outwardly directed transcription units of this pair of divergent promoters retain their presumed ancestral (murine) function, namely the transcription of BRCA1 and NBR1, respectively (Fig. 1A). The inwardly facing transcription units, however, have acquired novel functions; dissociation from their respective coding sequences has liberated them to transcribe into the intervening 45-kb region. Although the copy of the bgeneralQ NBR1 1b promoter does not seem to give rise to transcripts, the copy of the upstream NBR1 1a promoter directs transcription of a novel mRNA, NBR2 [14], which incorporates exons from the extraneous inserted material. Although reasonably well expressed at the mRNA level and correspondingly well represented in the EST database, NBR2 has only a very short open-reading frame (112 codons), part of which comprises a LINE1 element sequence, commencing with a

poor Kozak motif. The termination of the open-reading frame 313 bp upstream of the last splice event would also imply that the mRNA may be subject to nonsense-mediated decay [15]. In addition, preliminary experiments (J. Selfe, unpublished data) suggest that expression at the protein level is negligible. The status of NBR2 as a bona fide bgeneQ therefore remains unclear. The copy of the BRCA1 promoter (known as cBRCA1) has suffered an insertion of 342 bp (mostly composing a processed pseudogene of acidic ribosomal phosphoprotein P1, ARPP1) into its first exon. All human ESTs originating from this promoter use a donor splice site within the ARPP1 sequence and thereafter incorporate a selection of cryptic exons. This rearrangement of the promoter regions of two biologically important genes in primates has implications for our specific functional knowledge of these genes, for our understanding of bidirectional promoters, and for the process of regulatory elaboration during primate evolution. In this work we describe the stepwise emergence of the complex structure of the human BRCA1/NBR2/cBRCA1/ NBR1 locus via a process of duplication and transposition of extraneous material. Our work reveals a series of largescale gene-conversion-mediated rearrangements which have occurred during primate evolution. These result in the generation of several large and potentially recombinogenic

Fig. 1. Duplication of the BRCA1/NBR1 promoter in OWMs. (A) Relationship between mouse and human BRCA1/NBR1 promoters. The promoter regions are shown schematically but to scale, with the single mouse copy in the middle and the two human copies above and below. Approximate transcription start sites as judged by examination of expressed sequences in public databases are marked with arrows. Black boxes indicate first exons. Hatched box indicates insertion of ARPP1 into one of the human copies. See Supplementary Fig. 1 for sequence-level detail. (B) Phylogenetic tree of mammalian BRCA1/NBR1 promoter sequences. Sequences acquired from primate BRCA1/NBR1 promoters (ranging in length from 1317 to 1706 bp) were aligned together with orthologous human, mouse, and rat sequences using CLUSTALW (see Supplementary Fig. 1) and used to generate a neighbor-joining phylogram. The tree was then midpoint-rooted. The node representing promoter duplication is labeled. See legend to Supplementary Fig. 1 for accession numbers.

H. Jin et al. / Genomics 84 (2004) 1071–1082

LCRs, the disruption of the highly conserved ARF2 gene, the duplication of the BRCA1/NBR1 promoter region, the creation of the NBR2 transcription unit, and the generation of seven copies of the KIAA0563 gene, all within the last 40 million years. Almost all of these rearrangements appear to be causally linked to the expansion of the AluS repeat subfamily. We discuss the consequences of these events for human genetics and evolution.

Results BRCA1 and NBR1 are not tightly linked in bony fish The availability of the almost complete genome sequence for Fugu rubripes (a teleost) affords an opportunity to examine the relationship between the BRCA1 and NBR1 genes in a nonmammalian vertebrate. The Fugu genome contains two closely related genes (10 and 7 kb long) which encode proteins resembling NBR1. In each case there are multiple upstream genes, none of which is related to BRCA1. Interestingly, two genes (ARF4L and DUSP3) which lie less than 500 kb 3V of the human and mouse NBR1 genes are found a short way downstream of one of the Fugu NBR1 genes, revealing a degree of conserved linkage between these distantly related vertebrates. The Fugu BRCA1 orthologue is encoded by a 15-kb gene lying on an unlinked sequence scaffold. Examination of the genome sequence of the zebrafish Danio rerio revealed a similar situation, namely two paralogous NBR1-related genes, neither closely related to the BRCA1 locus. We therefore infer that tight linkage of BRCA1 and NBR1 with a shared bidirectional promoter is not essential to their function, and that it is either a derived trait in mammals or a primitive trait lost in teleosts. The two paralogous NBR1 genes reflect a widespread teleostspecific genome duplication event which has been described elsewhere [16]. Duplication and divergence of the BRCA1/NBR1 promoter region To establish the point in primate evolution at which the promoter duplication occurred, we used a comparison of human and mouse sequences to inform the design of oligonucleotide primers likely to direct amplification of most primate BRCA1/NBR1 promoters in two overlapping ~800-bp sections. These were used to amplify genomic DNA (or in the case of baboon, DNA from a BAC containing the BRCA1 and NBR1 genes) from a range of primate species. We successfully obtained products from chimpanzee, gibbon, baboon, tamarin, and owl monkey. No products were obtained from tarsier despite the use of alternative primer sets and varied PCR conditions. All products were sequenced, either directly or after subcloning, depending

1073

on whether a mixture of products was obtained. All OWM species (chimpanzee, gibbon, baboon) yielded two sequences, one clearly related to each of the two human copies of the promoter (Accession Numbers: AY581855, AY581856, AY581857, AY581858, AY581859, AY581860). Each of the NWMs (tamarin and owl monkey) yielded only one sequence, equally related to each of the human copies (Accession Numbers: AY581861, AY581862). We aligned these sequences together with the human, mouse, and rat sequences from GenBank (Supplementary Fig. 1). A neighbor-joining phylogenetic tree (Fig. 1B) constructed from this alignment was consistent with the accepted phylogeny of order Primata, and indicated that the duplication of the BRCA1/NBR1 promoter region occurred in a common ancestor of extant OWMs shortly after their divergence from NWMs at the breakup of Gondwanaland. We note that the ARPP1 insertion is present in all of the OWM cBRCA1 promoters, suggesting that this was introduced shortly after the promoter duplication. Thereafter the two copies of the bidirectional promoter seem to have evolved separately, with little evidence of gene conversion between the two. The presence of both copies on a single baboon BAC clone suggests that as in humans, the two promoters remain tightly physically linked throughout primates. The cBRCA1/NBR1 copy (91% identical between human and baboon) is more highly conserved than the BRCA1/NBR2 copy (87%). The human genome contains at least eight loci related to NBR2 A BLAST search of the human genome (ENSEMBL build 34) using the 266-bp NBR2 exon 3 sequence reveals eight distinct closely related loci (96–97% identical to NBR2), all on chromosome 17. Six of these are within a 4Mb region of 17q21. The remaining two are also on chromosome 17 (Fig. 2A). We designate them NBR2 loci 1, 2, 3, 3V, 4, 5, 6, and 7, with locus 1 being that adjacent to BRCA1. To assess the physical extent of the similarity between these eight loci, we performed dot-plot comparisons (using DOTTER Ref. [17]) between the adjacent genomic regions, testing 20-kb segments against each other (between loci and within loci) exhaustively until no further unique regions of similarity were detected. The results are summarized in Fig. 2B. The islands of similarity around the NBR2 loci range from 99.7% identity over 160 kb between the tandemly repeated loci 3 and 3V, through 94% identity over 50 kb between loci 5 and 4, down to 91% identity over only ~3 kb between loci 1 and 2 (Fig. 3A). Loci 3, 3V, 4, and 6 are on the opposite strand from loci 1, 2, 5, and 7. An example of a DOTTER plot between 20-kb segments is shown in Fig. 3B. Calculation of the predicted sizes of NBR2-containing EcoRI fragments from the genomic sequence shows reasonable agreement with the observed restriction pattern obtained from total human DNA, taking into account

1074

H. Jin et al. / Genomics 84 (2004) 1071–1082

Fig. 2. Sequence relationships between the human and the mouse NBR2-related loci. (A) Large-scale human/mouse synteny relationship in the 17q21 region. The upper horizontal line denotes human chromosome 17 from 37.5 Mb (17q12) to 63.4 Mb (17q24.1), centered around a 4-Mb region in 17q21.31. The lower horizontal line denotes the orthologous region from mouse chromosome 11. Lines connecting these two lines indicate large-scale human-mouse orthology. Green pentagons indicate the position and orientation of NBR2 loci. Red arrowheads indicate BRCA1 and NBR1 promoters. The five exons of the ARF2 genes are also shown. (B) Pattern of sequence relatedness between the eight human loci (1–7) and two mouse loci (M; the BRCA1/NBR1 locus at the top and the NBR2 ancestral locus at the bottom), as revealed by low-stringency BLAST search and dot-plot analysis. The diagrams are shown to scale (see 20-kb scale bar) and color-coded as follows: red, shared between mouse and human NBR2 loci; green, shared between loci 3, 3V, 4, 5, and 7; cream, shared between loci 3, 5, and 7; blue, regions of BRCA1/NBR1 locus shared between duplicates; black, exons or exon-related sequences. Pink triangles indicate AluS repeats at rearrangement boundaries. The positions of the NBR2 exon 3 sequences (indicated by vertical dotted lines) in ENSEMBL build 34 are as follows: locus 1, 41663839–41664104; locus 2, 42471767–42472032; locus 3, 45109686–45109951; locus 3’, 44892272–44892537; locus 4, 45611886–45612151; locus 5, 44056049–44056314; locus 6, 37576777–37577013; locus 7, 63395941–63396206.

unresolved multiple bands indicated by increased dosage (Fig. 3C, right-hand panel).

The mouse NBR2-related locus is syntenic with human locus 5

The number of NBR2-related loci has increased during primate evolution

A BLAST search of the mouse genome (ENSEMBL build 30) using the block of conserved sequence containing NBR2 exon 3 revealed a single related mouse locus, lying approximately 1.9 Mb from the murine BRCA1 gene. Although giving a significant diagonal on dotplots between human and mouse loci, the similarities between the largest aligned block range between 60 and 70%. Comparative studies of orthologous human and mouse introns (assumed not to be under functional selective pressure) show that about a quarter of their length comprises alignable blocks, these averaging about 75% identity [18]. This suggests that NBR2 sequences have not been under substantial selective pressure for most of the history of the human and mouse lineages. The position of the mouse NBR2-related sequence puts it in an identical gene context to human locus 5, suggesting that the

To assess the representation of NBR2-related sequences in primates, genomic DNAs from a range of primate and nonprimate eutherian mammals were digested with EcoRI, electrophoresed on an agarose gel, blotted, and probed with radiolabeled NBR2 exon 3 probe. The resulting autoradiogram (Fig. 3C) reveals a single band in the prosimian lemur (also in galago, not shown), two bands in NWMs (marmoset and owl monkey; shown in Fig. 3C for EcoRI, but also seen with HindIII), and three to five bands in OWMs and related Catarrhini (African green monkey, baboon, gibbon, chimpanzee, human), with increased dosage suggesting unresolved multiplets in human and chimpanzee.

H. Jin et al. / Genomics 84 (2004) 1071–1082

1075

Fig. 3. Distribution of NBR2-related sequences in primates and other mammals. (A) Similarity matrix showing average percentage identity between human NBR2 loci over entire extent of homologous blocks, together with the approximate length of these blocks in kilobases. (B) Example of dot plot performed using DOTTER between 20-kb sections of loci 3 and 5. This plot shows the extreme left-hand end of locus 3, where the duplicated material terminates in an AluSx repeat. The subfamily of Alu repeat (as determined by RepeatMasker) is indicated on both axes. (C) Left-hand panel: Southern blot of EcoRI-digested genomic DNAs from indicated species, probed with human NBR2 exon 3. A broad phylogeny of the animals is given above. Right-hand panel: short exposure of a higher resolution blot of EcoRI-digested human DNA, showing approximate correspondence between empirical band sizes and EcoRI fragment sizes derived from the public human genome sequence of the eight identified NBR2 loci (at right, in base pairs). The locus 6 band is only seen on longer exposures as the hybridization target is shorter and more divergent.

ancestor of all primate NBR2-like sequences lies in this location (Figs. 2A and 2B). Deviations from human-mouse synteny and disruption of the ARF2 gene Comparison of the orthologous regions of the human and mouse genomes shows that although there is largescale synteny, the region between human loci 3 and 4 is inverted and translated with respect to the corresponding sequence in mouse (Fig. 2A). Loci 3, 4, and 5 lie at the boundaries of syntenic blocks. It is very likely that this complex rearrangement arose via two serial inversion/ duplication events: (a) generation of locus 4 and inversion of the region between loci 5 and 4; (b) generation of locus 3 and inversion of the region between loci 5 and 3. Interestingly, the gene for ADP-ribosylation factor 2 (ARF2), while intact in the mouse, is bisected by the inversion, such that exons 1–3 lie near locus 3 (and are transcribed and spliced onto this copy of NBR2 exon 3; see below) and exons 4 and 5 lie some distance from locus 5. The human genome does not appear to contain an intact copy of ARF2.

Human transcripts related to NBR2 A BLAST search of human EST sequences held by the NCBI using NBR2 exon 3 revealed 16 entries corresponding to locus 1, and 23 corresponding to locus 3. No ESTs encoded by other loci were observed. Eleven of the 16 locus 1 entries commence with the canonical NBR2 transcript structure (Accession Number U88573), namely that duplicated copies of NBR1 exons 1 and 3 are followed by NBR2 exon 3. Of the remainder, three retain an intron and two include novel exonic material. Almost all of those which continue 3V of exon 3 terminate in the intron rather than splicing onto the reported exon 4. In all but two of the 23 locus 3 entries, NBR2 exon 3 is immediately preceded by exons 1–3 of a degenerate version of the ARF2 gene (see above). After NBR2 exon 3, almost all terminate in the intron rather than splicing onto an additional exon. Thus the ancestral (locus 5) copy of the NBR2 exon 3 sequence does not itself appear to contribute substantially to mRNA transcripts; instead its juxtaposition 3V of strong promoters (the copy of the NBR1 promoter at locus 1 and the disembodied ARF2 promoter at locus 3) results in its transcription and inclusion in a processed

1076

H. Jin et al. / Genomics 84 (2004) 1071–1082

transcript. In the case of locus 1, only a small coding region would be translated, initiated from a methionine codon in exon 3. In the case of the locus 3 transcript, however, the bNBR2Q ORF is spliced in register with the preceding ARF2 ORF, resulting in a potential novel 177residue protein. New world monkeys have two NBR2-related loci A cosmid library of genomic DNA from the NWM C. jacchus was probed for BRCA1-, NBR1-, and NBR2related sequences. No NBR1-positive clones were obtained. The single BRCA1-positive cosmid did not cross-hybridize with the NBR2 probe. The six positive NBR2 clones fell into two clear groups of three on the basis of shared restriction digest patterns (data not shown). NBR2 crossreacting restriction fragments from each group were of an identical size to one of the two bands detected in a Southern blot of total genomic DNA from the closely related species S. oedipus (Fig. 3C), thereby showing that both loci were fully accounted for. Amplification and sequence analysis of their NBR2-related regions confirmed that they corresponded to two distinct loci (Accession Numbers: AY581866, AY581867). Hybridization of a Southern blot of C. jacchus NBR2 cosmids with probes derived from the MPP2 and PPY genes (which lie near human locus 2) showed that one of the loci was orthologous to human locus 2. PCR using primers flanking exon one of the KIAA0356 gene (which lies near human locus 5) showed that the second NWM locus was orthologous to the mouse NBR2-related sequence and to human locus 5. NWMs hence have a single orthologue of the last common ancestor of most human loci and an orthologue of human locus 2; there appear to be no NBR2-related sequences at the BRCA1 locus. Sequence analysis of the ends of the cosmids confirms this conclusion.

6, 9, and 6 clones representing loci 2, 4, 5, and 6, respectively, and 11 clones containing sequences mapping to human chromosomes other than chromosome 17 (in most cases, a region on chromosome 9). NBR2 exon 3 sequences could be amplified and sequenced (Accession Numbers: AY581863, AY581864, AY581865) from all clones except those containing locus 6 (which lacks one of the primer binding sites) and those mapping to other chromosomes (which are considered false positives). The relatively even representation of clones for loci 2, 4, 5, and 6 strongly suggests that these are the only four NBR2-related loci in baboons, a finding consistent with the genomic Southern blot data (Fig. 3C). Gene conversion between NBR2-related loci In an attempt to establish the set of orthologous and paralogous relationships among the human, baboon, and marmoset NBR2 exon 3 sequences from sequence data alone, we aligned them (data not shown) and generated a neighbor-joining phylogenetic tree (Fig. 4A). Unexpectedly, this showed unambiguous clustering within species, such that all human loci (except for the distant locus 6) resembled each other more than any resembled a given baboon or marmoset locus, and so on. This implies either (a) that the multiple NBR2 loci have arisen independently in each animal lineage from a single ancestral locus, or (b) that extensive interlocus gene conversion has occurred following a smaller number of ancestral duplication events. The latter explanation is much more parsimonious, and is strongly supported by our observation of conserved linkage of adjacent genes. Thus orthology could be established via conserved linkage even when gene conversion destroys information at the sequence level.

Discussion

Old world monkeys have four NBR2-related loci

A history of the BRCA1-NBR1-NBR2 locus

To obtain a detailed snapshot of the evolution of the NBR2 loci in the primate clade, we used BRCA1, NBR1, and NBR2 hybridization probes to isolate genomic BAC clones from the olive baboon, Papio anubis, an OWM. This yielded 45 clones-five containing both BRCA1- and NBR1related sequences, and 40 containing NBR2-related sequences. Despite the duplication of the bidirectional promoter in baboon, none of the BRCA1/NBR1-containing BACs possessed NBR2-related material as detected by hybridization or PCR. The 40 NBR2 BAC clones were characterized by analysis of their NBR2-related sequence, by testing for the presence or absence of sequences known to lie near the eight known human loci, and by direct sequence analysis of the clone ends using vector primers. This allowed resolution of the following distinct classes of sequence: 8,

We have been able to combine a detailed analysis of public database information with newly acquired data from a range of primate species to piece together a likely evolutionary history (based on parsimony) of the complex human BRCA1-NBR1-NBR2 locus and the surrounding chromosomal region. A tentative description of this history is as follows (see Figs. 4B and 5): (a) Head-to-head fusion of BRCA1 and NBR1 in the tetrapod clade We found that BRCA1 and NBR1 orthologues in the nontetrapod vertebrates D. rerio and F. rubripes are not closely linked to each other, and certainly do not share single bidirectional promoters. Although the genome sequence is not currently available for a close common ancestor of teleosts and mammals which might enable us to

H. Jin et al. / Genomics 84 (2004) 1071–1082

1077

Fig. 4. Expansion of the primate NBR2-related sequence family. (A) Neighbor-joining phylogenetic tree constructed from aligned human, baboon, and marmoset NBR2 sequences (Accession Numbers: baboon locus 2, AY581863; baboon locus 4, AY581864; baboon locus 5, AY581865; marmoset locus 2, AY581866; marmoset locus 5, AY581867. See legend to Fig. 2 for sources of human locus sequence.). Locus numbers have been determined solely from the nearby presence of syntenic loci and BAC/cosmid end sequences. (B) Inferred historical tree of primate loci based on presence of syntenic loci, Southern blot data, human genomic map location, and extent and degree of long-range similarity. Bold line indicates human locus 5, which is syntenic with the single ancestral locus. Horizontal dotted lines indicate the presumed situation in last common ancestors of humans and the indicated species. Letters correspond to those used to denote duplication events in the Discussion. Lozenges indicate main periods for expansion of the indicated Alu repeat subfamilies [24].

determine whether the linked or unlinked state is ancestral, we have assumed that the unlinked state is ancestral, and that head-to-head fusion of the two genes occurred in a tetrapod ancestor of mammals. (b) Duplicative transposition of NBR2-like sequence in an ancestor of NWM and OWMs We show that the single mammalian ancestor of the NBR2 exon 3 sequence (orthologous to human locus 5) underwent a duplication at some point between the simian/ prosimian split and the NWM/OWM split. This gives two NBR2-related loci in NWMs (locus 2 and locus 5). Subsequent gene conversion has occurred between these two loci.

(e) Inversion/duplication near locus 5 to give locus 4 in an ancestor of OWMs A 1.5-Mb region telomeric to locus 5 is inverted and N50 kb (including locus 5) is duplicated to generate locus 4. These rearrangements result in four loci in baboon, (and probably African green monkey and gibbon), with substantial gene conversion occurring between at least three of these loci. (f) Duplicative transposition of NBR2-like locus to BRCA1 in great apes A duplication of one of the NBR2-related loci (possibly locus 5) generates locus 1 between the two copies of the BRCA1/NBR1 promoter. In humans this sequence is then used as an exon in the NBR2 transcript.

(c) Tandem duplication of BRCA1/NBR1 promoter region in an ancestor of OWMs We have directly shown that paralogous duplicated bidirectional promoter sequences exist in extant OWMs but not in NWMs. We infer that a tandem duplication of the region occurred approximately 30 million years ago, after which the two copies diverged substantially. Both copies retain some transcriptional activity in both directions. Appreciable gene conversion has not occurred between copies of the bidirectional promoter. The cBRCA1 exon 1 incurred an insertion of an ARPP1 sequence soon after duplication.

(g) Inversion/duplication near locus 5 to give locus 3 in great apes An 800-kb region telomeric to locus 5 is inverted and N75 kb (including locus 5) is duplicated to generate locus 3. The degree of divergence between these two loci (2.6%) is greater than the average divergence between human and chimpanzee intergenic DNA (1.2% Ref. [19]), suggesting that the rearrangement is present in both species. The inversion splits the ARF2 gene, resulting in a novel ARF2-NBR2 fusion transcript from locus 3 in great apes.

(d) Duplicative transposition to generate NBR2 locus 6 in an ancestor of OWMs Locus 6 is distant from the other loci, perhaps accounting for the lack of ongoing gene conversion.

(h) Duplicative transposition to generate NBR2 locus 7 in great apes This locus is the most distant, and may not have undergone gene conversion since its creation.

1078

H. Jin et al. / Genomics 84 (2004) 1071–1082

(i) Tandem duplication of locus 3 in humans A 160-kb region around locus 3, including the novel ARF2-NBR2 fusion gene, is tandemly duplicated to give locus 3V. The low sequence divergence between loci 3 and 3V (0.3%) suggests that it is human specific, and may even be polymorphic within our species. We note that in all but one case the duplication of an NBR2 locus to generate a novel locus in the same orientation results in a simple duplicative transposition, whereas generation of a novel locus in the opposite orientation results in inversion of the intervening material (inversion/duplication). It is likely that each rearrangement results from an aberrant gene conversion event, with the outcome determined by the orientation of the target strand. The status of NBR2 as a bgeneQ The tandem duplication of the BRCA1/NBR2 promoter region in OWMs results in the juxtaposition of two orphan promoters (bNBR2Q and bcBRCA1Q) against novel material. In humans the cBRCA1 promoter generates a diverse set of noncoding transcripts using a variety of cryptic exons (data not shown). The NBR2 promoter, however, generates a fairly homogeneous set of transcripts, most of which compose the copies of NBR1 exons 1 and 3 (NBR2 exons 1 and 2), followed by NBR2 exon 3. This latter exon is only present in apes, and seems to be the 3V-most exon in the majority of ESTs, with the canonical 5-exon transcript being relatively underrepresented. The longest open-reading frame in the 3-exon transcript commences with a poor Kozak sequence, and encodes a potential protein of only 63 residues, with no significant similarity to a protein of known function. The rarer 5-exon transcripts encode a further 49 residues at the C-terminus, derived from a LINE-1 element. From these data it seems unlikely that NBR2 encodes a protein of functional significance. The status of the novel ARF2-NBR2 fusion bgeneQ The ADP-ribosylation factor (ARFs) and ARF-like (ARL) proteins constitute a family of small (175- to 185amino acid) regulatory GTPases within the Ras superfamily. Most mammals appear to have six ARFs which fall into three groups, each having a single invertebrate orthologue (Type I, ARF1/ARF2/ARF3; Type II, ARF4/ARF5; Type III, ARF6). The ARFs combine a regulated GTPase activity with a bmyristoyl switchQ which coordinates membrane association. ARFs are thought to function predominantly in the regulation of vesicle trafficking, particularly in the Golgi. Although there is a wealth of structural data on these proteins, the consequences of loss of function of any given ARF in mammals are unknown. ARF1 and ARF3 are each 100% identical between human and mouse at the amino acid level. ARF2, on the

other hand, despite showing 100% identity between mouse, rat, and cow, is grossly aberrant in humans. We show that this is due to a duplicative inversion in an ancestor of great apes, whereby the human orthologue of the mouse ARF2 gene was torn in two. Thus in humans (and presumably chimps, gorillas and orang utans) exons 1–3 and exons 4–5 of ARF2 lie almost 700 kb apart and on opposite DNA strands. In addition, a further humanspecific tandem duplication has generated a second copy of exons 1–3 together with 160 kb of surrounding material. The two copies are 99.7% identical at the nucleotide level. The promoter and splice sites for human ARF2 exons 1– 3 appear to be fully functional, and result in a large number of ESTs in which these three exons are spliced onto the locus 3 (or locus 3V) version of NBR2 exon 3. The context of the human ARF2 translational start site is almost identical to that of the mouse, and the reading frame is conserved and in register with the largest open-reading frame of NBR2 exon 3. It therefore seems highly likely that a reasonable amount of the encoded novel 177-residue fusion protein is produced. The first half of this protein (residues 1–96) encodes three a-helices and four h-sheets which make up one hemisphere of the ARF structure [20]. These 96 codons have undergone 21 nonsynonymous mutations, 14 of which result in nonconservative substitution of amino acids which are invariant in all known metazoan ARFs. Conservation of the N-terminus suggests that the product of the ARF2NBR2 fusion gene will be myristoylated. We have therefore identified a striking difference between the great apes and all other mammals, namely the gross disruption in apes of a gene (ARF2) which encodes an extremely highly conserved protein, and the creation of a fusion gene (ARF2-NBR2) capable of expressing a novel protein. Reports of such substantial differences between humans and other primates are rare [21]. Multiple copies of the KIAA0563 gene The locus 5 region contains the gene encoding the KIAA0563 transcript [22] which encodes a transmembrane leucine-rich repeat-containing protein of unknown function. Although there are number of distantly related paralogues in humans and other mammals, the NBR2 locus duplications in OWMs generate multiple closely related copies (at loci 3, 3V, 4, and 7, together with partial copies at 2 and 6). Almost all of these appear to be expressed and are represented in the GenBank EST database, suggesting ongoing expression. Ongoing potential role of NBR2 loci in genomic rearrangement The extensive LCRs, together with the high historical gene conversion rates, raise the question as to whether this region of 17q21 is subject to ongoing deletion or inversion events, most likely via NAHR during meiosis [3]. While

H. Jin et al. / Genomics 84 (2004) 1071–1082

locus 7 is rather too distant, it can be seen from Figs. 2A and 3A that loci 3–5 lie within 2 Mb of each other, and share 94–99% identity over regions ranging from 50 to 160 kb. This set of parameters is well within the range of those of LCRs involved in a variety of human pathogenic mutations [2]. For example, the LCRs responsible for duplications and deletions of the peripheral myelin protein 22 (PMP22) gene in Charcot-Marie-Tooth syndrome type 1A (CMT1A [MIM 118220]) and hereditary neuropathy with liability to pressure palsies (HNPP [MIM 162500]) consist of 24-kb regions 1.5 Mb apart and sharing 98.7% identity. Likewise those responsible for deletions of the neurofibromin gene in neurofibromatosis type 1 (NF1 [MIM 162200]) are 60-kb regions 1.5 Mb apart and 97% identical. Given the known high NAHR rates between pathogenic LCRs, it seems extremely likely that there are numerous individuals carrying the following NBR2 locus-mediated rearrangements: (a) deletion or duplication of the region between loci 3/3V and 4, (b) inversion of the region between loci 5 and 3/3V or between 5 and 4. The deletion or duplication will only have a phenotypic effect if one of the four interstitial genes is dosage sensitive (as happens in CMT1A/HNPP). The genes lying between loci 3 and 4 encode two proteins involved in membrane fusion (Nethylmaleimide-sensitive factor NSF and Golgi SNAP

1079

receptor complex member 2 GOSR2) and two members of the Wnt family of signaling proteins (Wnt-3 and Wnt9B). A loss-of-function mutation in the Wnt-3 causes tetraamelia with a strictly recessive mode of inheritance (OMIM 273395), heterozygotes showing no evidence of haploinsufficiency [23]; consequences of mutations in the other three genes are not known. The inversions, on the other hand, would not disrupt any genes, and are expected not to have direct phenotypic consequences. They may even be polymorphic within the human population; however, polymorphic inversions have been found to predispose to other rearrangements such as recurrent translocations [3]. Alu repeats and primate genome evolution The NBR2 locus, having spent much of its history in a stable state, appears to have been involved in a large number of gene-conversion-related events in the last 40 million years of primate evolution (Fig. 5). This has included five duplicative transpositions, two inversion/ duplications, and enough nonduplicative gene conversion between homologous loci to obliterate any trace of sequence orthology between loci in at least three species (humans, baboons, and marmosets). The reason for this apparent change in behavior is not clear, although it has

Fig. 5. A proposed history of the human BRCA1/NBR1/NBR2 locus. An inferred history of the region is shown with evolutionary time running from top to bottom. Red arrows indicate BRCA1- and NBR1-related promoters. Green pentagons indicate position and orientation of NBR2-related loci. Blue arrow indicates a presumed continuous ancestral genomic region (containing the ARF2 gene - yellow rectangle) which undergoes a series of rearrangements. Black arrows indicate duplication events. Animal names in parentheses indicate species used to infer common ancestral state. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

1080

H. Jin et al. / Genomics 84 (2004) 1071–1082

been suggested [3,24] that the high frequency of LCRs in the human genome (up to 6%) may be causally related to the primate-specific explosion of Alu retroposon activity 35–40 Mya. Estimated times of Alu subfamily expansions [24] are 65–40 Mya (AluJ), 25–45 Mya (AluS), and 30–0 Mya (AluY), the last two coinciding temporally with most of the observed primate segmental duplication events. The same authors found that the frequency of Alu repeats at the junctions of human LCRs was more than twice the genome average, with almost all of this enrichment being accounted for by the more recently expanded AluS and AluY subfamilies. To assess whether this global tendency was reflected in our detailed case, we compared the junctions of LCRs within the BRCA1 region (as determined by dot plot) with the location of repeats (as determined by Repeat Masker). We found that almost all of the junctions described in this paper lie within members of the AluS family, including the ends of the BRCA1/NBR1 promoter duplication (see pink triangles in Fig. 2B). The exceptions are both ends of locus 2, which were not in Alu repeats (and whose generation predates the OWM/NWM divergence), and the left-hand end of locus 7, which was not established. The right-hand ends of loci 3V, 4, and 6, and of 7 and 3, although close, are in distinct AluS copies. Fig. 3B shows an example of such a junction, that at the left-hand end of locus 3, where the end of the shared material lies in an AluSx repeat. It is also interesting to note that the single member of the AluY subfamily (which expanded later than the AluJ and AluS subfamilies) is not shared by the loci. The association between duplication breakpoints and AluS repeats is highly significant. Alu repeats make up 21% of the genome sequence around locus 5, and 35% of the sequence around locus 1, an unusually high figure (AluS elements account for 12 and 19%, respectively; these data are derived from RepeatMasker analysis of 300 and 126 kb centered around the two loci). A binomial test of the likelihood of 14 or more of the 17 breakpoints hitting Alu repeats by chance yields P values b10 4 for the high locus 1 Alu density and b10 5 for the lower locus 5 density. The correlation, both spatially and temporally, between AluS transposition and large-scale duplication, provides strong circumstantial evidence for a causative link between the expansion of this repeat subfamily and most of the gross rearrangements in this genomic region. In summary, we have documented a burst of duplicational events which coincides temporally with the expansion of a repeat subfamily (AluS) whose members lie at the end of each duplicated region. This represents strong circumstantial evidence that the fresh transposition of Alu repeats in OWMs directly triggered the reshaping of this genomic region (and presumably others). Among the consequences of this Alu-triggered remodeling over the last 30 million years are the duplication of the

BRCA1/NBR1 promoter, the disruption of the highly conserved ARF2 gene, the creation of at least two novel transcription units, two megabase-scale inversions, and the generation of multiple divergent copies of the KIAA0563 gene. This implicates the AluS family in the generation of the raw material on which natural selection might operate. As a price for this, the potential for pathogenic NAHR has also been increased.

Materials and methods Mammalian DNAs Genomic DNA was extracted from cultured cells of a range of primate species: Old world monkeys (OWMs), human (Homo sapiens cell line MOLT4), chimpanzee (Pan troglodytes cell line CRL-1609), gibbon (Hylobates lar cell line MLA 144), and African green monkey (Cercopithecus aethiops cell line COS-7); New world monkeys (NWMs) representing the two principal families, the northern owl monkey (Aotus trivirgatus cell line CRL-1556) from family Cebidae, and the cotton-topped tamarin (Saguinus oedipus cell line B95-8) from family Callithricidae. Prosimians were represented by genomic DNA from the crowned lemur (Lemur coronatus; a gift from Christian Roos at the Deutches Primatezentrum) and the Philippine tarsier (Tarsius syrichta; a gift from David Haring at the Duke University Primate Center). In addition, we used genomic DNA libraries from the OWM olive baboon (Papio anubis - BAC library RPCI-41 cloned in pBACe3.6 from BACPAC Resources, Oakland, CA) and the callithricid NWM common marmoset (Callithrix jacchus-cosmid library RZPD-160 cloned in Lawrist7 from Deutsches Ressourcenzentrum fqr Genomforschung GmbH, Berlin, Germany). PCR The following primers were used to amplify a ~700-bp fragment containing NBR2 exon 3-related sequences from baboon and marmoset genomic clones: NBR2b2f (GACCGCTCAGCTTTCATTCCAGTG) and NBR2b2r (TTACGTTACACTACCCCAGATAGG) (665–768 bp products from human loci). Further primer pairs were used to amplify two overlapping DNA segments which together form a ~1400- to 1800-bp region spanning the BRCA1/NBR2, cBRCA1/NBR1, or BRCA1/NBR1 promoters from total genomic DNA of a range of primates: BRpromAF (TTAYGCCTCTCAGGTTCCGCCCC) and BRpromAR (CCAATCTATCCACTGGATTTCCGTG) (807 or 1157 bp in humans); BRpromBF (TCCGCCCTAATGGAGGTCTCCAG) and BRpromBR (GCAGGATTCCTCCCTTGAACTTC) (791 or 801 bp in humans). In the case of OWMs, where primers BRpromBF and BRpromBR generate two nearly comigrating products,

H. Jin et al. / Genomics 84 (2004) 1071–1082

these were cloned into pCR2.1-TOPO vector (Invitrogen) before sequencing. Southern blotting Genomic DNA was extracted from in vitro cultured cells using a sodium perchlorate-based method [25], and approximately 10 Ag was digested with an excess of restriction enzyme. The digested DNA was electrophoresed in a 0.8% agarose gel, transferred onto nylon membrane (Hybond N+, Amersham, UK), and hybridized with a32P-dCTP-labeled DNA probe before washing to moderate stringency (1X SSC, 658C) and exposing to Kodak XAR film.

1081

305: 525–528, and Lafrate et al., Nature Genetics 2004, 36: 949–951). In each of these, a distinct copy number polymorphism involving the NBR2 loci is identified (CNP79 in Sebat et al., and RP11-79O18 in Lafrate et al.), thereby confirming a central prediction of our paper.

Acknowledgments This work was supported by a Medical Research Council Ph.D. studentship (J.S.) and Medical Research Council programme Grant G9600577 (J.R.M., C.W., E.S.).

Sequence analysis

Appendix A. Supplementary data

PCR products were sequenced directly or after cloning into plasmid vectors using PCR primers and vectors primers, respectively (sequences available on request). Sequencing was performed on an ABI 3100 automated fluorescent sequencer using an ABI BigDye Terminator v2.0 kit according to manufacturer’s instructions. Cosmids and BACs were sequenced in a similar manner except for the use of 100 thermal cycles for the sequencing reaction. T7 and SP6 promoter primers were used for both pBACe3.6 and Lawrist7 clones.

Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ygeno. 2004.08.019.

Bioinformatics Newly acquired sequence was curated and manipulated in Vector NTI Suite Version 6.0 (InforMax Inc., Bethesda, MD). Sequence alignments and phylogenetic trees were generated using ClustalW. Genomic sequence data for Homo sapiens, Mus musculus, Rattus norvegicus, Fugu rubripes, and Danio rerio were acquired through the Ensembl Genome Server, http://www.ensembl.org, and the GenBank database at NCBI, http://www.ncbi.nih.gov/Genbank/. Although the NCBI Build 34 of the human genome sequence has an apparent ~100-kb gap within the NBR1 gene, closer examination shows that the two flanking sequences, AC060780 and AC109326, actually overlap by a 2-kb region which includes two internal exons of the NBR1 gene. Combined with the fact that there is no break in humanmouse synteny across NBR1, and that the intact NBR1 gene is separately represented in the unassigned contig NT_078102 (AC087365), we feel able to ignore the Build 34 gap. Dot-plots were performed using DOTTER [17]. Repetitive elements were detected using Repeat Masker, http://repeatmasker.genome.washington.edu/.

Note added in proof Since submitting this manuscript, two papers have been published which examine large-scale copy number variation across the human genome (Sebat et al., Science 2004,

References [1] J.A. Bailey, et al., Recent segmental duplications in the human genome, Science 297 (2002) 1003 – 1007. [2] P. Stankiewicz, J.R. Lupski, Genome architecture, rearrangements and genomic disorders, Trends Genet. 18 (2002) 74 – 82. [3] C.J. Shaw, J.R. Lupski, Implications of human genome architecture for rearrangement-based disorders: the genomic basis of disease, Hum. Mol. Genet. 13 (2004) R57 – R64. [4] J.A. Bailey, et al., Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22, Am. J. Hum. Genet. 70 (2002) 83 – 100. [5] J.K. Kulski, et al., The evolution of MHC diversity by segmental duplication and transposition of retroelements, J. Mol. Evol. 45 (1997) 599 – 609. [6] D. Torrents, M. Suyama, E. Zdobnov, P. Bork, A genome-wide survey of human pseudogenes, Genome Res. 13 (2003) 2559 – 2567. [7] A. Wong, et al., Diverse fates of paralogs following segmental duplication of telomeric genes, Genomics 84 (2004) 239 – 247. [8] Y. Miki, et al., A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1, Science 266 (1994) 66 – 71. [9] R. Scully, D.M. Livingston, In search of the tumour-suppressor functions of BRCA1 and BRCA2, Nature 408 (2000) 429 – 432. [10] I.G. Campbell, et al., A novel gene encoding a B-box protein within the BRCA1 region at 17q21.1, Hum. Mol. Genet. 3 (1994) 589 – 594. [11] C. Whitehouse, et al., NBR1 interacts with fasciculation and elongation protein zeta-1 (FEZ1) and calcium and integrin binding protein (CIB) and shows developmentally restricted expression in the neural tube, Eur. J. Biochem. 269 (2002) 538 – 545. [12] J.A. Chambers, E. Solomon, Isolation of the murine Nbr1 gene adjacent to the murine Brca1 gene, Genomics 38 (1996) 305 – 313. [13] S. Dimitrov, M. Brennerova, J. Forejt, Expression profiles and intergenic structure of head-to-head oriented Brca1 and Nbr1 genes, Gene 262 (2001) 89 – 98. [14] C.F. Xu, et al., Isolation and characterisation of the NBR2 gene which lies head to head with the human BRCA1 gene, Hum. Mol. Genet. 6 (1997) 1057 – 1062. [15] E. Nagy, L.E. Maquat, A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance, Trends Biochem. Sci. 23 (1998) 198 – 199. [16] J.S. Taylor, I. Braasch, T. Frickey, A. Meyer, Y. Van de Peer, Genome

1082

[17]

[18]

[19]

[20]

H. Jin et al. / Genomics 84 (2004) 1071–1082 duplication, a trait shared by 22000 species of ray-finned fish, Genome Res. 13 (2003) 382 – 390. E.L. Sonnhammer, R. Durbin, A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis, Gene 167 (1995) GC1 – GC10. N. Jareborg, E. Birney, R. Durbin, Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs, Genome Res. 9 (1999) 815 – 824. F.C. Chen, W.H. Li, Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees, Am. J. Hum. Genet. 68 (2001) 444 – 456. J.C. Amor, D.H. Harrison, R.A. Kahn, D. Ringe, Structure of the human ADP-ribosylation factor 1 complexed with GDP, Nature 372 (1994) 704 – 708.

[21] P. Gagneux, A. Varki, Genetic differences between humans and great apes, Mol. Phylogenet. Evol. 18 (2001) 2 – 13. [22] T. Nagase, et al., Prediction of the coding sequences of unidentified human genes. IX. The complete sequences of 100 new cDNA clones from brain which can code for large proteins in vitro, DNA Res. 5 (1998) 31 – 39. [23] S. Niemann, et al., Homozygous WNT3 mutation causes tetra-amelia in a large consanguineous family, Am. J. Hum. Genet. 74 (2004) 558 – 563. [24] J.A. Bailey, G. Liu, E.E. Eichler, An Alu transposition model for the origin and expansion of human segmental duplications, Am. J. Hum. Genet. 73 (2003) 823 – 834. [25] M.B. Johns Jr., J.E. Paulus-Thomas, Purification of human genomic DNA from whole blood using sodium perchlorate in place of phenol, Anal. Biochem. 180 (1989) 276 – 278.