Genomic sequence comparison of the human and mouse XRCC1 DNA repair gene regions

Genomic sequence comparison of the human and mouse XRCC1 DNA repair gene regions

GENOMICS 25, 5 4 7 - 5 5 4 (1995) Genomic Sequence Comparison of the Human and Mouse XRCC1 DNA Repair Gene Regions JANE E. LAMERDIN,1 MISHELLE A. MON...

993KB Sizes 0 Downloads 34 Views

GENOMICS 25, 5 4 7 - 5 5 4 (1995)

Genomic Sequence Comparison of the Human and Mouse XRCC1 DNA Repair Gene Regions JANE E. LAMERDIN,1 MISHELLE A. MONTGOMERY, STEPHANIEA. STILWAGEN, DSA K. SCHEIDECKER, ROBERT S. TEBBS, KERRYW. BROOKMAN, LARRY H. THOMPSON, AND ANTHONY V. CARRANO Human Genome Center, Biology and Biotechnology Research Program, L-452, Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, California 94550

ReceivedAugust9, 1994;revisedNovember4, 1994 p o r t a n t f o r p r o p e r s p l i c i n g o f t h e t r a n s c r i p t to m a i n t a i n r e g i o n s of t h e XRCC1 p r o t e i n r e q u i r e d f o r p r o p e r folding, v 1995AcademicP..... Inc.

T h e XRCC1 (X-ray r e p a i r c r o s s c o m p l e m e n t i n g ) g e n e is i n v o l v e d in t h e efficient r e p a i r of DNA s i n g l e - s t r a n d b r e a k s f o r m e d b y e x p o s u r e to i o n i z i n g r a d i a t i o n a n d a l k y l a t i n g a g e n t s . T h e h u m a n g e n e m a p s to c h r o m o s o m e 19q13.2, a n d t h e m o u s e h o m o l o g u e m a p s to t h e s y n t e n i c r e g i o n o n c h r o m o s o m e 7. T w o c o s m i d s (app r o x i m a t e l y 38 k b e a c h ) c o n t a i n i n g t h e h u m a n a n d m o u s e g e n e s w e r e s e q u e n c e d to a n a v e r a g e 8-fold c l o n a l r e d u n d a n c y . T h e XRCC1 g e n e s p a n s a g e n o m i c d i s t a n c e of 26 k b in m o u s e a n d 31.9 k b in h u m a n . B o t h g e n e s c o n t a i n 17 exons, a r e 84% i d e n t i c a l w i t h i n t h e c o d i n g r e g i o n s , a n d a r e 86% i d e n t i c a l a t t h e a m i n o a c i d s e q u e n c e level. I n t r o n a n d e x o n l e n g t h s a r e h i g h l y cons e r v e d . F o r t h e h u m a n c o s m i d , a t o t a l o f 43 A l u r e p e t i t i v e e l e m e n t s a r e p r e s e n t , a d e n s i t y of 1.1 Alu/kb, b u t d u e to c l u s t e r i n g , t h e local d e n s i t y is as h i g h as 1.8 Alu/ kb. I n a d d i t i o n , we o b s e r v e d a s t a t i s t i c a l l y s i g n i f i c a n t b i a s f o r i n s e r t i o n of t h e s e e l e m e n t s in t h e 3 ' - 5 ' o r i e n t a t i o n r e l a t i v e to t h e d i r e c t i o n of XRCC1 t r a n s c r i p t i o n , p r e d o m i n a n t l y in t h e s e c o n d a n d t h i r d i n t r o n s . T h i s b i a s m a y i n d i c a t e t h a t XRCC1 is m o r e a c c e s s i b l e to Alu retroposition events during transcription than genes not expressed during spermatogenesis. The density of B1 a n d B2 e l e m e n t s in t h e m o u s e is 0.4/kb, i n t e g r a t e d p r i m a r i l y in t h e 5 ' - 3 ' o r i e n t a t i o n . T h e h u m a n c h r o m o s o m e 19-specific m i n i s a t e l l i t e PE670 w a s p r e s e n t in t h e s a m e o r i e n t a t i o n in 3 i n t r o n s in t h e h u m a n gene, a n d a s i m i l a r r e p e a t w a s f o u n d at 3 d i f f e r e n t l o c a t i o n s in the mouse cosmid. Five simple sequence repeats were f o u n d in t h e h u m a n c o s m i d , a n d 16 d i f f e r e n t r e p e a t s w e r e o b s e r v e d in t h e m o u s e c o s m i d . T h e c o d i n g r e g i o n p r e d i c t i o n a l g o r i t h m XGRAIL 1.1 i d e n t i f i e d 15 of 17 e x o n s in t h e h u m a n g e n e a n d 14 o f 17 in t h e m o u s e . I n a d d i t i o n to t h e c o d i n g r e g i o n s , 9 c o n s e r v e d e l e m e n t s w e r e i d e n t i f i e d b e t w e e n m o u s e a n d h u m a n , w i t h seq u e n c e i d e n t i t i e s r a n g i n g f r o m 65 to 78%. S e v e r a l o f t h e s e e l e m e n t s c o r r e s p o n d to i n t r o n s t h a t a r e cons e r v e d a c r o s s t h e i r e n t i r e l e n g t h a n d m a y b e im-

INTRODUCTION

The U. S. G o v e r n m e n t ' s right to retain a nonexclusive royalty-free license in and to the copyright covering this paper, for g o v e r n m e n t a l purposes, is acknowledged. 1 To w h o m correspondence should be addressed. Telephone: (510) 423-3629. Fax: (510) 422-2282.

T h e XRCC1 (X-ray c r o s s - c o m p l e m e n t i n g r o d e n t rep a i r g r o u p 1) gene p r o d u c t is involved in t h e r e p a i r of D N A s i n g l e - s t r a n d b r e a k s in r e s p o n s e to e i t h e r ionizing r a d i a t i o n or a l k y l a t i n g a g e n t s ( T h o m p s o n et al., 1982). T h e h u m a n a n d m o u s e g e n e s encode p r o t e i n s of 633 a n d 631 a m i n o acids, respectively, w h i c h a r e 86% identical a n d 93% s i m i l a r ( B r o o k m a n et al., 1994). Recent b i o c h e m i c a l s t u d i e s s u g g e s t t h a t t h e X R C C 1 prot e i n is r e q u i r e d for t h e activity o f D N A ligase I I I (Caldecott et al., 1994) a n d i n d e e d is p h y s i c a l l y a s s o c i a t e d w i t h D N A ligase I I I w i t h i n cells, b u t its function rem a i n s u n k n o w n . O t h e r g e n e s involved in D N A r e p a i r and replication have been physically m a p p e d within a 2-Mb region of c h r o m o s o m e 1 9 q 1 3 . 2 - q 1 3 . 3 c o n t a i n i n g XRCC1. T h e y include ERCC1 (excision r e p a i r crossc o m p l e m e n t i n g r o d e n t r e p a i r g r o u p 1) ( T h o m p s o n et al., 1985), E R C C 2 ( M o h r e n w e i s e r et al., 1989; T h o m p son et al., 1990), a n d D N A ligase I ( B a r n e s et al., 1992). T h e s y n t e n i c region on m o u s e c h r o m o s o m e 7 c o n t a i n s t h e m o u s e h o m o l o g s of t h e s e genes. T h e c h a r a c t e r i z a t i o n of D N A r e p a i r p a t h w a y s derives f r o m s o m a t i c cell genetic s t u d i e s to c h a r a c t e r i z e n e w c o m p l e m e n t a t i o n g r o u p s in r e p a i r - d e f e c t i v e r o d e n t cell lines a n d to identify i n d i v i d u a l g e n e s by cloning, as well as b i o c h e m i c a l s t u d i e s of i n d i v i d u a l p r o t e i n s a n d p r o t e i n - p r o t e i n c o m p l e x e s in vitro. Defects in rep a i r p r o t e i n s c a n m a n i f e s t as specific disease phenotypes, such as x e r o d e r m a p i g m e n t o s u m a n d C o c k a y n e s y n d r o m e , w h e r e i n e a c h disease m a y r e p r e s e n t differe n t i n t e r a c t i o n s of the s a m e few p r o t e i n s (Cleaver, 1994). T h e h u m a n p h e n o t y p e s p r o d u c e d b y defects in m a n y of t h e D N A r e p a i r genes, e.g., X R C C 1 , a r e not known. Hence, a n i m a l models to s t u d y t h e p h e n o t y p e of a defect in e a c h of t h e s e g e n e s would be v e r y useful. A k n o c k o u t of t h e m o u s e ERCC1 g e n e h a s b e e n described

547

0888-7543/95 $6.00 Copyright © 1995 by Academic Press, Inc. All rights of reproduction in any form reserved.

548

LAMERDIN ET AL.

( M c W h i r et al., 1993). I t is t h e f i r s t s u c c e s s f u l t r a n s g e n i c a n i m a l m o d e l of a D N A r e p a i r g e n e , a n d o n e w h i c h p r o d u c e d a m u c h m o r e s e v e r e p h e n o t y p e i n homozygous mice t h a n was expected. While comparative sequence from human and mouse e x i s t s for s e v e r a l D N A r e p a i r g e n e s a t t h e c D N A l e v e l ( V a n D u i n et al., 1988; B r o o k m a n et al., 1994), l i t t l e g e n o m i c s e q u e n c e is c u r r e n t l y a v a i l a b l e . G e n o m i c sequence not only would more clearly define intron/exon b o u n d a r i e s of c o d i n g r e g i o n s , b u t also w o u l d i d e n t i f y c o n s e r v e d r e g u l a t o r y e l e m e n t s . S u c h c o m p a r a t i v e seq u e n c e w o u l d b e i n v a l u a b l e for s e v e r a l a p p l i c a t i o n s , i n c l u d i n g c o n s t r u c t i o n of g e n e t a r g e t i n g v e c t o r s for hom o l o g o u s r e c o m b i n a t i o n i n m o u s e e m b r y o n i c s t e m (ES) cells a n d s t u d i e s o f g e n e r e g u l a t i o n . M o r e o v e r , g e n o m i c sequence comparisons would provide insight into the e v o l u t i o n of c o m p l e x g e n o m e s . W e d e s c r i b e t h e g e n o m i c s t r u c t u r e of 38 k b i n h u m a n a n d m o u s e c o n t a i n i n g the XRCC1 D N A r e p a i r gene. Several highly conserved elements other than coding r e g i o n s a r e i d e n t i f i e d , a s w e l l a s t h e l o c a t i o n s of v a r i o u s repetitive elements. MATERIALS A N D M E T H O D S

Cosmid subcloning and sequencing. The procedures used for identification and isolation ofXRCCl-containing cosmids f5050 (human) and MXR1-13 (mouse cosmid library derived from strain DBA2) are described elsewhere (Trask et al., 1993; Brookman et al., 1994). Ten micrograms of each cosmid was sonicated at 4°C in a volume of 350 pl using two 3-s pulses separated by 2 min at the lowest setting on a Model 185 Sonifier cell disruptor (Heat Systems-Ultrasonics, Inc., Plainview, NY). End repair was performed twice with a mixture ofT4 DNA polymerase and Klenow fragment as previously described (Martin-Gallardo et al., 1993). The repaired DNA was size-fractionated on 0.8% agarose, and fragments from 2 to 6 kb were excised and purified using GeneClean (Biol01, La Jolla, CA). The bluntended DNA was subcloned into the SmaI site of pBluescript SK(+) (Stratagene, La Jolla, CA) and transformed into DH5a competent cells (GIBCO-BRL/Life Technologies, Gaithersburg, MD). DNA templates for sequencing were prepared by the Qiagen minialkaline lysis method according to the manufacturer's instructions (Qiagen, Chatsworth, CA). Double-stranded templates were sequenced on a Catalyst 800 Molecular Biology Labstation (Applied Biosystems (AB), a division of Perkin-Elmer, Foster City, CA) using KS and T3 fluorescently labeled primers and AB Taq cycle sequencing kits. Resultant sequencing ladders were loaded on a 6% polyacrylamide gel, and data were collected on an AB 373A DNA sequencer. Sequence assembly and gap closure. Individual sequences were manually edited to remove vector and ambiguities at the 3' end using AB SeqEd software. Cosmid assembly was performed in GEL (Intelligenetics (IG), Mountain View, CA). After approximately 85% of each cosmid had been assembled, the projects were transferred to the IG GENeration program, where assembly was completed. The average read length for double-stranded templates entered into each project was -330 bp. Gaps in the cosmid sequence were filled by primer walking or subcloning of PCR products and restriction fragments spanning the gaps. Exonuclease III digestion (McCombie et al., 1991) was used to generate new sequencing templates from the cloned fragments. PCR-generated templates were amplified from two sources, the cosmid and a gap-spanning plasmid, both products were subcloned, and multiple clones from each product were sequenced so that random errors introduced by PCR could be readily detected. Only two such polymerase errors were detected, and these were easily resolved by comparison to the sequence from an overlapping shotgun clone. Ambiguities in the assembled sequence were resolved by

visual inspection of the chromatograms, and in some instances resequencing of ambiguous regions with Sequenase DyeDeoxy Terminators (AB). Areas containing compressions or other mobility artifacts were resequenced using Taq DyeDeoxy Terminators with dITP. Gap closure in both species was complicated by the frequent occurrence of difficult repeat regions, e.g., PE670 (Das et al., 1987) and Alu repeats in human and long simple sequence repeats in mouse. Clone instability was often associated with these repeat regions. A total of 955 fragments were sequenced to assemble the human cosmid to an average redundancy of 8.4. The sequence was determined on both strands for 87.4% of the cosmid; those areas not covered on both strands are flanked by, or encompass, repetitive regions, specifically Alu or PE670. However, redundancy on the sequenced strand in all of these regions is greater than 4-fold. For the mouse cosmid, 959 fragments were sequenced to an overall redundancy equal to that of the human cosmid. Sequence was determined on both strands for 96% of the mouse cosmid. The sequence of each cosmid was submitted to the Genome Sequence Data Base under Accession Nos. L34079 for human and L34078 for mouse. Sequence analysis. Coding regions of the XRCC1 gene were identiffed by comparison of the human cDNA to the genomic sequences using the ALIGN program (IG suite) or by XGRAIL, Version 1.1 (Uberbacher and Mural, 1991). Repetitive elements in both species were initially identified by comparison to a subset of known human (Alu, L1, THE, LTR, and PE670) and mouse (B1, B2, L1, and PE670) repeats using ALIGN. Additional searches were performed against GSDB (daily update), the Repetitive Element Data Base (Jurka et al, 1992), SWISSPROT (Release 26), and Prosite (Release 10.2) using FASTA or BLASTN (Altschul et al., 1990). More stringent homology comparisons of the two cosmids were performed using Alignment Tools (Hardison and Miller, 1993). Conserved elements were screened against the Transcription Factor Database (Release 7) using BLASTX and the Eukaryotic Promoter Database (Release 35) using TBLASTN. RESULTS AND D I S C U S S I O N

Sequence Validation Both h u m a n a n d m o u s e cosmid sequences were valid a t e d b y c o m p a r i s o n of r e s t r i c t i o n d i g e s t s of t h e c o s m i d to t h a t p r e d i c t e d b y t h e a s s e m b l e d s e q u e n c e . F o r t h e h u m a n c o s m i d , s i n g l e (BamHI, BssHII, DraI, EagI,

EcoRI, HindIII, KpnI, MluI, NaeI, NotI, PvuI, SalI, SfiI, SmaI, SphI, XbaI, a n d XhoI) a n d d o u b l e (EagI/ SalI, EagI/SfiI, EagI/BssHII, BssHII/SfiI, SmaI/SalI, BamHI/SalI, a n d HindIII/SalI) d i g e s t s w e r e p e r f o r m e d . F o r t h e m o u s e c o s m i d , c o m p a r i s o n w a s m a d e to a complete restriction map generated by partial digests ( B r o o k m a n et al., 1994). All b u t t w o r e s t r i c t i o n s i t e s predicted by the assembled sequences agreed with the f r a g m e n t sizes p r o d u c e d b y d i g e s t i o n . T h e t w o except i o n s o c c u r r e d for XbaI r e s t r i c t i o n s i t e s a t p o s i t i o n s 7259 a n d 2 1 , 6 3 7 i n t h e h u m a n c o s m i d . XbaI is s e n s i t i v e to c y t o s i n e m e t h y l a t i o n a t G A T C b y t h e Dam m e t h y l ase p r e s e n t in m o s t bacterial host s t r a i n s (Nelson a n d M c C l e l l a n d , 1992). B o t h XbaI s i t e s p r e d i c t e d i n t h e h u m a n c o s m i d t h a t f a i l e d to c u t w i t h t h e e n z y m e contained the sequence TCTAGAtc, indicating an overlapp i n g D a m m e t h y l a t i o n site. P C R p r i m e r s u s e d for g a p c l o s u r e a n d o t h e r s des i g n e d s p e c i f i c a l l y for s e q u e n c e v a l i d a t i o n w e r e t e s t e d a g a i n s t t h e s e q u e n c e d c o s m i d s a n d e i t h e r h u m a n or m o u s e g e n o m i c D N A to v e r i f y t h a t t h e p r o p e r size fragment was present in the appropriate genome. For the

HUMAN AND MOUSE XRCC1 DNA SEQUENCE

549

c i-

TCTG 8

Repeats 311 PE670

A,u

II

I I

I

If

Exons

II

I

I I I

II

IIIII I IIIII I 4

12

I

lO

Ill II

I

I IIII III III III cen

tel

I

I

XGRAIL

A,u

Plus

11-14 15-17

5 -10

III

I

Minus

I III IIIII I II IIIII]1111

]

T

GA 7 CAlo

CA9

0

5

III1,

I,,l,l,

10

15

,J,I

20

,,,l,

25

,, Jl,

30

,,,1,,,,1,,

35

40

kb

Jl

FIG. l. Structural elements within the 37.8-kb sequence containing the h u m a n XRCC1 DNA repair gene and flanking regions. The position of each element in the cosmid is indicated as a vertical line relative to the linear scale at the bottom of the figure. Elements placed above the solid black line are found on the plus strand (relative to the direction of transcription of XRCC1), and elements below are found on the minus strand. The orientation of transcription of XRCC1 is from telomere to centromere. Included are the positions of the 17 exons, as identified by homology to the h u m a n cDNA, and those exons or ORFs predicted by XGRAIL. The copy number of various tandemly repeated elements, e.g., PE670 and simple sequence repeats, is indicated above or below the type of repeat. Also identified are the insertion positions of numerous Alu repeats. Exon and repeat element lengths are not drawn to scale.

mouse XRCC1 cosmid, 11 pairs of primers with an average distance of 2.4 kb between pairs were designed and tested. The primers produced an average fragment size of 850 bp and spanned 89% of the mouse cosmid. All primer pairs tested amplified the same size fragment in the cosmid and mouse genomic DNA. Nine primer pairs from the h u m a n cosmid, separated by an average distance of 2.9 kb and generating an average fragment of 790 bp, were tested against h u m a n genomic DNA. All 9 pairs produced the expected size fragment in both the cosmid and the h u m a n chorionic genomic DNA. Fewer primer pairs were designed for the h u m a n cosmid due to the high incidence of clustered Alu repeats, as detailed below, which made amplification across these elements difficult. Analysis of the H u m a n Cosmid Sequence The analysis of the 37,785 bp encompassing the human XRCC1 gene is illustrated in Fig. 1. The h u m a n gene is composed of 17 exons and spans a genomic distance of 31.9 kb. The orientation of transcription of the XRCC1 gene was determined to be from telomere to centromere by comparing predicted EcoRI restriction fragment order to the EcoRI map information from overlapping cosmids spanning the region containing cosmid f5050 and other anchored genetic markers (data not shown). A total of 43 Alu repetitive elements, including 2 right monomers, are present, a density of 1.1 Alu/kb. This density is slightly lower than the 1.4 Alu/ kb density seen at the ERCC1 locus (Martin-Gallardo et al., 1992), located in 19q13.3, but higher than the average density (0.25 per kb) predicted for the rest of the genome (Moyzis et al., 1989). There is a heavy bias

(2:1) for insertion of these elements in the 3 ' - 5 ' orientation (displayed schematically as present on the minus strand), predominantly in introns 2 and 3, and many tandemly arrayed in a head-to-tail fashion (several regions contain 2 . 5 - 3 Alu repeats in tandem). The difference observed was statistically significant (P < 0.001) using a )¢2 test for goodness of fit. This bias may result from a preference for integration of new Alu retroposons into the poly(A) tail or middle A-rich region of a previously integrated element, as the sequences flanking new Alu insertions typically have an average A - T composition of 67% (Batzer et al., 1990). As a result of this clustering effect, the highest local density of Alu elements in the intragenic region of XRCC1 is 32 elements in 18 kb or 1.8 Alu/kb. Similar clustering of Alu elements has been reported in other genomic regions, particularly in the HLA class III region on chromosome 6 (Iris et al., 1993), with densities ranging from 1.7 to 2 Alu/kb. It has been proposed that retroposition of these elements is affected by the local structure of the chromatin during spermatogenesis (Wallace et al., 1991; Vivaud et al., 1993). If true, genes that are actively transcribed in spermatogenesis, such as XRCC1 (Yoo et al., 1992), may be more accessible to retroposition events. A partial Alu repeat was located at one end of the h u m a n cosmid insert. This might be expected since MboI, the enzyme used to generate the h u m a n chromosome 19 library (de Jong et al., 1989), has a recognition site in the Alu repetitive element. However, its presence, when taken in consideration of the average density of Alu elements in this cosmid, has implications for direct sequencing from the ends of h u m a n clones

550

LAMERDIN ET AL.

generated by partial MboI or Sau3A digests. Our experience with the generation of chromosome 19-specific sequence-tagged sites (STSs) has indicated t h a t as many as 30% of the cosmids selected had Alu repeats at one or both ends of t h e insert. Sequencing of chromosome 19-specific interoAlu PCR products also indicates a high number of tandemly arrayed Alu repeats (unpublished data). Since chromosome 19 is GC-rich (Gray et al., 1979; Langlois et al., 1982) and Alu sequences appear to predominate in GC-rich DNA due to their CpG content (Hellman-Blumberg et al., 1993), we expect a higher density of Alu repeats on chromosome 19 relative to t h a t expected for the genome as a whole. Therefore, reliance on walking strategies, or sequencing directly off the ends of cosmids, might not be feasible as primary sequencing strategies for this chromosome. The 37-bp minisatellite, PE670, which is specific to the long arm of h u m a n chromosome 19 (Trask et al., 1993), is found in the same orientation in introns 6 (3 repeat elements), 8 (11 repeat elements), and 12 (10 repeat elements). The intron 6 element is somewhat diverged; the two external repeats exhibit 72% identity to the consensus minisatellite sequence, while the middle repeat is 86% identical. The intron 8 and intron 12 repeat elements exhibit 79-97% homology to the consensus sequence, with an average of 89% homology. For these two introns, the most diverged elements tend to be flanked by the more highly conserved elements. The organization and sequence divergence of these elements suggest t h a t they have propagated through cis migration on the long arm of chromosome 19. The length of the intron 8 and intron 12 repeat units (~400 bp) and the difficulty encountered by the Taq polymerase in sequencing through this repetitive region made walking strategies to obtain complete double-strand coverage difficult in these areas. The read lengths obtained were generally not sufficient to cross the repeat into unique sequence, and there was insufficient sequence divergence within the repeat to determine overlap reliably between two adjacent contigs ending in highly conserved stretches of PE670. Instead, such regions were closed by Exonuclease III digestions of large plasmids spanning these regions. A few simple sequence repeats, primarily dinucleotide repeats, flanking the gene have been identified. Primers have been designed flanking a CA17 repeat to determine its potential for a new polymorphic marker at the XRCC1 locus (Lamerdin et al., in preparation).

Analysis of the Mouse Cosmid Sequence In Fig. 2, we depict the analysis of 37,349 bp containing the mouse XRCC1 gene. The 17 exons of the mouse gene span a genomic distance of 26 kb. The SINEs B1 and B2 are present at a density of 0.4 repeats per kilobase, which is similar to the average density of 0.3 B1 and B2 elements per kilobase predicted by hybridization experiments with total mouse genomic

DNA (Bennett et al., 1984). The mouse has a 53-bp repeat similar to the h u m a n minisatellite, PE670, which consists of a 24-bp conserved PE670-1ike repeat interspersed with a mouse-specific 29-bp repeat. AIthough the h u m a n minisatellite is known to be localized only to the long arm of chromosome 19, clustering in the mouse has not yet been examined. However, this repeat structure has been identified by sequence homology in other genes syntenic to the h u m a n 19q13.2 region (e.g., the apoE gene on mouse chromosome 7). In the mouse XRCC1 cosmid, these repeats are found in three locations, although not in the same introns as in the h u m a n cosmid. This would imply t h a t this repeat element probably evolved separately by cis migration after the divergence of these genes in h u m a n s and mice. Two complete and two flanking partial copies of an additional -53-bp t a n d e m repeat element in intron 3 exhibiting homology to a segment of the mouse zona pellucida (ZP3) gene were identified in a FASTA database search. Interestingly, the ZP3 gene also contains the mouse PE670-1ike repeat; however, this derivative repeat bears little resemblance to the original PE670like element, aside from its GC bias, and is present in only one location in the mouse XRCC1 cosmid, from bases 25,406 to 25,574. The mouse cosmid has a much more varied repertoire of simple sequence repeats t h a n the h u m a n cosmid, including several clusters of triand tetranucleotide repeats. Interestingly, these are dispersed in the introns, in sharp contrast to the human gene, where the few repeats found are in the region flanking the gene.

Coding Region Predictions We compared the ability of XGRAIL (Version 1.1) to identify the XRCC1 coding regions relative to the locations known from the h u m a n and mouse cDNA sequences. In Fig. 1, the XGRAIL predictions for the human gene are shown above the true exons as identified by homology to the cDNA. XGRAIL correctly identified 15 of 17 exons and predicted an alternative exon in intron 2. In the mouse gene, XGRAIL predicted 14 of 17 exons (Fig. 2) and two alternative exons. One of the alternative exons is also in intron 2, but does not correspond to the predicted alternative exon in h u m a n intron 2. Interestingly, XGRAIL was reproducibly unable to predict exons 7 and 8 in both species and also missed exon 14 in the mouse. On the minus strand, XGRAIL predicted weak (marginal) coding potential upstream of the h u m a n XRCC1 gene and several ORFs with excellent coding potential upstream of the mouse gene. All potential ORFs were translated and queried against SWISSPROT and dbEST, but no significant homologies were detected. The quality of donor and acceptor splice junctions for both species was evaluated using the matrix scoring system devised by Stormo (1987). In general, the scores for a specific donor or acceptor site were very similar for

HUMAN AND MOUSE XRCC1 DNA SEQUENCE CAs CCA14 CCT17 CAlo

III

Repeats PE670 B2

I

XGRAIL

1

II II

cDNA 5'>3' Exons

I

I 5

I

I I

B1

TGCTs

CAs TGCG12 CA17

I i

Ill ll

l

I

I t

I

1

I

I I 3

I

I '

i

lUl II IIIIIII

flflll IlU I II

4-10

11-14 15-17

J

I

B1 B2

CA7

III

12

XGRAIL

551

I

Plus

Minus

I

i

PE670

4

I

Repeats

I

i

CA7 GAs 0 5 I~

Jl

J II

CAls 10 p~

I I ~l

I i

CA;7 GAs 15 i I J Ii

20 Jll,

25 ll,,X

30

I a ~,

35

I f t,t

iI

40 lip

kb

j I

FIG. 2. Structural elements within the 37.3-kb sequence containing the mouse XRCC1 DNA repair gene and flanking regions (refer to Fig. 1 for details).

the two species (data not shown). All splice consensus sequences followed the AG and GT rule for acceptor and donor sequences, respectively. If we look specifically at exons 7 and 8 in both species, those not predicted by XGRAIL, the h u m a n exon 8 acceptor score is -54, while the score is +54 in the mouse, but both are in the range possible for acceptor sites with a consensus AG. Interestingly, both species have negative scores for the exon 6 acceptor site, due to the presence of an additional AG 3 bp downstream of the AG at the 3' end of intron 5. If this second acceptor sequence was used to produce an alternatively spliced message, the resulting protein would be missing one amino acid (a lysine), but would remain in frame. It is not known whether loss of this amino acid would have a deleterious effect on XRCC1 protein function.

repeat element, while the h u m a n intron has no repeats. No L1 sequences were found in either cosmid. F u r t h e r comparative analysis of the h u m a n and mouse genomic XRCC1 regions using Hardison's and Miller's (1993)Alignment Tools revealed several conserved elements in addition to the 17 exons. In all, 22 separate homology segments or conserved elements (CEs) were identified. These have been further broken into 26 segments to indicate intron and exon boundaries in cases where two exons were linked by a conserved intron. The lengths of these elements and their similarity are shown in Table 2. Those elements conTABLE 1 Intron and Exon Structure of the Mouse and Human XRCC1 G e n e R e g i o n s

Comparative Structure of the Human and Mouse Coding Regions The structure of the genomic regions containing the h u m a n and mouse XRCC1 genes are compared in Table 1. The mouse and h u m a n genes are 84% identical at the DNA level and 86% identical at the amino acid level (Brookman et al., 1994). Exon and intron lengths are very well conserved, especially for the smaller introns (e.g., introns 5, 7, and 13). The longest intron in both species occurs between exons 2 and 3 and is 11.8 kb in mouse and 13.9 kb in human. The larger size of the h u m a n intron appears to be primarily the result of the large number of Alu insertions (20 Alu elements compared to a total of 7 B1 and B2 elements in the mouse). Alu insertions also occur in and lengthen hum a n introns 3 and 10 relative to the mouse. Intron 15 in the mouse is the only intron significantly larger t h a n its h u m a n counterpart due to the insertion of a B2

Exon length (bp)

Intron length (bp)

Exon

Mouse

Human

Intron

Meuse

Human

1 2 3 4 5 6 7 8 9 10 i1 12 13 14 15 16 17

51 93 117 159 75 112 110 106 259 117 94 130 55 137 91 76 114

51 93 111 159 75 112 110 112 259 117 94 133 55 140 91 76 114

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

430 11,852 5,680 523 96 356 75 168 236 2,288 138 292 87 1,056 753 94

404 13,889 6,105 962 96 308 73 520 328 4,452 156 487 93 1,583 455 105

552

LAMERDIN ET AL.

t a i n i n g exons also h a v e conserved additional flanking sequence t h a t r e p r e s e n t s consensus splicing signals. Some of the noncoding e l e m e n t s are as highly cons e r v e d as the exons, e.g., CE5, CE8, a n d CE12 are 78% similar b e t w e e n m o u s e and h u m a n , while CE18 a n d CE25 (exons 9 a n d 16) are 78 a n d 77% similar, respectively. The first t h r e e elements, C E 1 - C E 3 , are located u p s t r e a m of the p u t a t i v e t r a n s c r i p t i o n a l s t a r t site, and as yet h a v e no identifiable function. CE1 corresponds to a m a r g i n a l ORF predicted by XGRAIL on the m i n u s s t r a n d in the h u m a n gene, b u t r e p r e s e n t s an ORF with excellent coding p o t e n t i a l on the m i n u s s t r a n d in the m o u s e cosmid. E l e m e n t s CE4 and CE5 are located in the 5 ' - u n t r a n s l a t e d region of the gene and contain put a t i v e CAAT a n d TATA boxes a n d a p u t a t i v e t r a n s c r i p tional s t a r t site, as identified previously ( T h o m p s o n et al., 1990). A p u t a t i v e GC box previously identified (Thompson et al., 1990) u p s t r e a m of the h u m a n gene does not a p p e a r to be conserved. The 89-bp i n t r o n 2 e l e m e n t (CE8) does not correspond to e i t h e r of the alt e r n a t i v e exons predicted by XGRAIL in the two species a n d is clearly noncoding in n a t u r e . Several i n t r o n s are conserved in t h e i r e n t i r e t y , specifically i n t r o n s 5 and 7 (CE12 a n d CE15), which are both less t h a n 100 bp in size a n d are 78 a n d 73% identical, respectively. It is i n t e r e s t i n g t h a t i n t r o n 5 is so well conserved due to the presence of a n a l t e r n a t i v e splice site at the 3' end o f i n t r o n 5. P e r h a p s this strucTABLE 2 C o n s e r v e d E l e m e n t s I d e n t i f i e d in a C o m p a r i s o n o f the Mouse and Human XRCC1 Gene Regions

Element name

Element length ( b p )

% similarity

CE1

138

65

CE2 CE3 CE4 CE5 CE6 CE7 CE8 CE9 CE10 CEll CE12 CE13 CE14 CE15 CE16 CE17 CE18 CE19 CE20 CE21 CE22 CE23 CE24 CE25 CE26

271 83 141 128 90 121 89 133 182 84 96 119 124 75 117 82 293 133 132 151 81 156 113 127 121

76 75 67 78 84 89 78 86 87 79 78 89 81 73 73 65 78 90 82 81 89 79 83 77 88

Region Putative ORF on minus strands Upstream of 5'-UTR Upstream of 5'-UTR 5'-UTR 5'-UTR Exon 1 Exon 2 Intron 2 element Exon 3 Exon 4 Exon 5 Intron 5 Exon 6 Exon 7 Intron 7 Exon 8 Part of intron 8 Exon 9 Exon 10 Exon 11 Exon 12 Exon 13 Exon 14 Exon 15 Exon 16 Exon 17

t u r e is i m p o r t a n t for p r o p e r splicing of the gene. T h e first 82 bp of i n t r o n 8 (CE17) are conserved b e t w e e n m o u s e and h u m a n up to the point of the i n s e r t i o n of a 400-bp s t r e t c h of P E 6 7 0 in the h u m a n cosmid. Conservation of this region b e t w e e n the two species m i g h t be a t t r i b u t a b l e to the p r e s e n c e of a p u t a t i v e n u c l e a r localization signal m o t i f of the XRCC1 protein at the end of exon 8 ( B r o o k m a n et al., 1994). All noncoding conserved e l e m e n t s were q u e r i e d against the T r a n scription F a c t o r D a t a b a s e (Release 7) u s i n g BLASTX and the E u k a r y o t i c P r o m o t e r D a t a b a s e (Release 35) using T B L A S T N . No significant homologies were detected. The level of conservation of t h e s e noncoding elem e n t s m i g h t s u p p o r t a role in the r e g u l a t i o n of this gene. Similar levels of homology in noncoding e l e m e n t s h a v e been seen in o t h e r genes sequenced in the h u m a n and mouse. F o r example, in the h u m a n I L l 3 gene, a 100-bp region a n d 288-bp region each with 67% sequence i d e n t i t y to the m o u s e sequence were observed in i n t r o n s 2 and 3, respectively (McKenzie et al., 1993). Considerable homology (76%) was also observed in a region 5' of the initiation codon of the gene. Similar levels of conservation (65%) were seen in the 5 ' - U T R a n d i n t r o n i of the h u m a n a n d m o u s e E P O genes (Galson et al., 1993) a n d 5' flanking sequences of m o u s e and h u m a n e and ¢J globin genes (Shehee et al., 1989). Unlike the I L l 3 and E P O genes, however, the mouse and h u m a n XRCC1 genes do not s h a r e a n y significant homologies in the 3 ' - U T R or in the additional 1091 bp of sequence c o m p a r e d f u r t h e r 3' of the gene. None of the r e m a i n i n g conserved i n t r o n s of the XRCC1 gene was physically n e a r a n y of the regions conserved bet w e e n XRCC1 and the y e a s t rad4 protein (Fenech et al., 1991; L e h m a n n , 1993). T h e conserved regions observed in XRCC1 m a y be i m p o r t a n t for p r o p e r splicing of the t r a n s c r i p t to m a i n t a i n regions of the protein req u i r e d for p r o p e r folding. Although no canonical t r a n scription factor binding motifs were identified in the conserved elements, t h e s e e l e m e n t s m a y still be involved in r e g u l a t i o n of expression of this gene. The XRCC1 m e s s a g e is more highly e x p r e s s e d in testis, ovary, and b r a i n t h a n in liver or spleen (Yoo et al., 1992). It has b e e n p o s t u l a t e d t h a t XRCC1 m a y play a role in DNA processing d u r i n g meiogenesis and recombination in g e r m cells. P e r h a p s consensus binding sites for t r a n s c r i p t i o n factors involved in r e g u l a t i o n of g e r m line processes are not well r e p r e s e n t e d in the databases. The c o m p a r a t i v e genomic sequences described for the h u m a n a n d m o u s e XRCC1 genes provide the tools for f u r t h e r investigations into gene function and regulation. To date, no h u m a n p h e n o t y p e has b e e n associated with a defect in X R C C 1 , a n d the protein exhibits homologies to only short regions of the Schizosaccharomyces pombe rad4 gene (Fenech et al., 1991). Thus, w i t h o u t highly homologous y e a s t m u t a n t s to study, construction of a t r a n s g e n i c m o u s e model could shed f u r t h e r light on this gene's function a n d even deter-

HUMAN AND MOUSE XRCC1 DNA SEQUENCE

mine w h e t h e r this gene is essential to cell viability, as the E R C C 1 mouse model dem ons t r a t e d (McWhir et al., 1993). To this end, sequence data from the mouse cosmid have been used to construct a targeting vector for a knockout of this gene in mouse embryonic stem cells (R. Tebbs, 1994 pers. comm.). The identification of noncoding conserved elements provides a starting point for experiments to elucidate the mechanisms of transcriptional regulation of this gene, as well as provides useful insights into the levels of conservation t h a t we might expect to see in other gene regions in mouse and human. ACKNOWLEDGMENTS

We thank Dr. Mark Batzer for helpful discussions on Alu sequences, and Dr. Elbert Branscomb for technical assistance with identification of conserved elements. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-ENG-48. REFERENCES

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215: 403 -410. Barnes, D. E., Kodama, K.-I, Tynan, K., Trask, B. J., Christensen, M., De Jong, P. J., Spurr, N. K., Lindahl, T., and Mohrenweiser, H. W. (1992). Assignment of the gene encoding DNA ligase I to human chromosome 19q13.2-13.3. Genomics 12: 164-166. Batzer, M. A., Kilroy, G. E., Richard, P. E., Shaikh, T. H., Desselle, T. D., Hoppens, C. L., and Deininger, P. L. (1990). Structure and variability of recently inserted Alu family members. Nucleic Acids Res. 18: 6793-6798. Bennett, K. L., Hill, R. E., Pietras, D. F., Woodworth-Gutal, M., Kane-Haas, C., Houston, J. M., Heath, J. K., and Hastie, N. D. (1984). Most highly repeated dispersed DNA families in the mouse genome. Mol. Cell. Biol. 4: 1561-1571. Brookman, K. W., Tebbs, R. S., Allen, S., Tucker, J. D., Swiger, R. R., Lamerdin, J. E., Carrano, A. V., and Thompson, L. H. (1994). Isolation and characterization of mouse Xrcc-1, a DNA repair gene affecting ligation. Genomics 22: 180-188. Caldecott, K. W., McKeown, C. K., Tucker, J. D., Ljungquist, S., and Thompson, L. H. (1994) An interaction between the mammalian DNA repair protein XRCC1 and DNA ligase III. Mol. Cell. Biol. 14: 68-76. Cleaver, J. E. (1994). It was a very good year for DNA repair. Cell 76: 1-4. Das, H. K., Jackson, C. L., Miller, D. A., Left, T., and Breslow, J. L. (1987). The human apolipoprotein C-II gene sequence contains a novel chromosome 19-specific minisatellite in its third intron. J. Biol. Chem. 262: 4787-4793. de Jong, P. J., Yokobata, K., Chen, C., Lohman, F., Pederson, L., McNinch, J., and Van Dilla, M. (1989). Human chromosomespecific partial digest libraries in X and cosmid vectors. Cytogenet. Cell Genet. 51: 985. Fenech, M., CarT, A. M., Murray, J., Watts, F. Z., and Lehmann, A. R. (1991). Cloning and characterization of the rad4 gene of Schizosaccharomyces pombe: A gene showing short regions of sequence similarity to the human XRCC1 gene. Nucleic Acids Res. 19: 6737-6741. Galson, D. L., Tan, C. C., Ratcliffe, P. J., and Bunn, H. F. (1993). Comparison of the human and mouse erythropoietin genes shows extensive homology in the flanking regions. Blood 82: 3321-3326. Gray, J. W., Langlois, R. G., Carrano, A. V., Burkhart-Schultz, K.,

553

and Van Dilla, M. A. (1979). High resolution chromosome analysis: One and two parameter flow cytometry. Chromosoma 73: 9-27. Hardison, R., and Miller, W. (1993). Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol. Biol. Evol. 10: 73-102. Hellman-Blumberg, U., McCarthy Hintz, M. F., Gatewood, J. M., and Schmid, C. W. (1993). Developmental differences in methylation of human Alu repeats. Mol. Cell. Biol. 13: 4523-4530. Iris, F. J. M., Bougueleret, L., Prieur, S., Caterina, D., Primas, G., Perrot, V., Jurka, J., Rodriguez-Tome, P., Claverie, J. M., Dausset, J., and Cohen, D. (1993). D e n s e A l u clustering and a potential new member of the NFKB family within a 90 kilobase HLA Class III segment. Nature Genet. 3: 137-145. Jurka, J., Walichiewicz, J., and Milosavljevic, A. (1992). Prototypic sequences for human repetitive DNA. J. Mol. Evol. 35: 286-291. Langlois, R. G., Yu, L.-C., Gray, J. W., and Carrano, A. V. (1982). Quantitative karyotyping of human chromosomes by dual beam flow cytometry. Proc. Natl. Acad. Sci. USA 79: 7876-7880. Lehmann, A. R. (1993). Duplicated region of sequence similarity to the human XRCC1 DNA repair gene in the Schizosaccharomyces pombe rad4/cut4 gene. Nucleic Acids Res. 21: 5274. Martin-Gallardo, A., McCombie, W. R., Gocayne, J. D., FitzGerald, M. G., Wallace, S., Lee, B. M. B., Lamerdin, J., Trapp, S., Kelley, J. M., Liu, L.-I., Dubnick, M., Johnston-Dow, L. A., Kerlavage, A. R., De Jong, P., Carrano, A., Fields, C., and Venter, J. C. (1992). Automated DNA sequencing and analysis of 106 kilobases from human chromosome 19q13.3. Nature Genet. 1: 34-39. Martin-Gallardo, A., Lamerdin, J., and Carrano, A. V. (1993). Shotgun sequencing. In "Automated DNA Sequencing and Analysis" (M. Adams, C. Fields, and J. C. Venter, Eds.), pp. 37-41, Academic Press, London. McCombie, W. R., Kirkness, E., Fleming, J. T., Kerlavage, A. R., Iovannisci, D. M., and Martin-Gallardo, A. (1991). The use of exonuclease III deletions in automated DNA sequencing. Methods Enzymol. 3: 33-40. McKenzie, A. N. J., Li, X., Largaespada, D. A., Sato, A., Kaneda, A., Zurawski, S. M., Doyle, E. L., Milatovich, A., Francke, U., Copeland, N. G., Jenkins, N. A., and Zurawski, G. (1993). Structural comparison and chromosomal localization of the human and mouse IL-13 genes. J. Immunol. 150: 5436-5444. McWhir, J., Selfridge, J., Harrison, D. J., Squires, S., and Melton, D. W. (1993). Mice with DNA repair gene (ERCC-1) deficiency have elevated levels of p53, liver nuclear abnormalities and die before weaning. Nature Genet. 5: 217-224. Mohrenweiser, H. W., Carrano, A. V., Fertitta, A., Perry, B., Thompson, L. H., Tucker, J. D., and Weber, C. A. (1989). Refined mapping of the three DNA repair genes, ERCC1, ERCC1, and XRCC1, on human chromosome 19. Cytogenet. Cell Genet. 52: 11-14. Moyzis, R. K., Torney, D. C., Meyne, J., Buckingham, J. M., Wu, J.-R., Burks, C., Sirotkin, K. M., and Goad, W. B. (1989). The distribution of interspersed repetitive DNA sequences in the human genome. Genomics 4: 273-289. Nelson, M., and McClelland, M. (1991). Site specific methylation: Effect on DNA modification methyltransferases and restriction endonucleases. Nucleic Acids Res. 19: 2045-2071. Shehee, W. R., Loeb, D. D., Adey, N. B., Burton, F. H., Casavant, N. C., Cole, P., Davies, C. J., McGraw, R. A., Schichman, S. A., Severynse, D. M., Voliva, C. F., Weyter, F. W., Wisely, G. B., Edgell, M. H., and Hutchison, C. A., III (1989). Nucleotide sequence of the BALB/c mouse fl-globin complex. J. Mol. Biol. 205: 41-62. Stormo, G. D. (1987). Identifying coding sequences. In "Nucleic Acid and Protein Sequence Analysis: A Practical Approach" (M. J. Bishop and C. J. Rawlings, Eds.), pp. 231-258, IRL Press, Oxford. Thompson, L. H., Brookman, K. W., Dillehay, L. E., Carrano, A. V., Mazrimas, J. A., Mooney, C. L., and Minkler, J. L. (1982). A CHO-cell strain having hypersensitivity to mutagens, a defect in strand-break repair, and an extraordinary baseline frequency of sister chromatid exchange. Mutat. Res. 95: 427-440.

554

LAMERDIN ET AL.

Thompson, L. H., Mooney, C. L., Burkhart-Schultz, K., Carrano, A. V., and Siciliano, M. J. (1985). Correction ofa nucleotide excision repair mutation by human chromosome 19 in hamster-human hybrid cell lines. Somatic Cell Mol. Genet. 11: 87-92. Thompson, L. H., Brookman, K. W., Jones, N. J., Allen, S. A., and Carrano, A. V. (1990). Molecular cloning of the human XRCC1 gene, which corrects defective DNA strand break repair and sister chromatid exchange. Mol. Cell. Biol. 10: 6160-6171. Trask, B., Fertitta, A., Christensen, M., Youngblom, J., Bergmann, A., Copeland, A., De Jong, P., Mohrenweiser, H., Olsen, A., Carrano, A., and Tynan, K. (1993). Fluorescence in situ hybridization mapping of human chromosome 19: Cytogenetic band location of 540 cosmids and 70 genes or DNA markers. Genomics 15: 133145. Uberbacher, E. C., and Mural, R. J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88: 11261-11265.

Van Duin, M., van den Tol, J., Warmerdam, P., Odijk, H., Westerveld, A., Bootsma, D., and Hoeijmakers, J. H. J. (1988). Evolution and mutagenesis of the mammalian excision repair gene ERCC-1. Nucleic Acids Res. 16: 5305-5322. Vivaud, D., Vivaud, M., Siguret, V., Sanchez, S. G., Laurian, Y., Meyer, D., Goossens, M., and Lavergne, J. M. (1993). Haemophilia B due to a de novo insertion of a human-specific Alu subfamily member within the coding region of the Factor IX gene. Eur. J. Hum. Genet. 1: 30-36. Wallace, M. R., Andersen, L. B., Saulino, A. M., Gregory, P. E., Glover, T. W., and Collins, F. S. (1991). A de novo Alu insertion results in neurofibromatosis type 1. Nature 353: 864-866. Yoo, H., Li, L., Sacks, P. G., Thompson, L. H., Becker, F. F., and Chan, J. Y.-H. (1992). Alterations in expression and structure of the DNA repair gene XRCC1. Biochem. Biophys. Res. Commun. 186: 900-910.