The sequence of the chromosomal mouse β-globin major gene: Homologies in capping, splicing and poly(A) sites

The sequence of the chromosomal mouse β-globin major gene: Homologies in capping, splicing and poly(A) sites

Cell, Vol. 15,1125-1132, December 1978, Copyright Q 1978 by MIT The Sequence of the Chromosomal Mouse /3-Globin Major Gene: Homologies in Capping,...

852KB Sizes 0 Downloads 27 Views

Cell, Vol. 15,1125-1132,

December

1978, Copyright

Q 1978 by MIT

The Sequence of the Chromosomal Mouse /3-Globin Major Gene: Homologies in Capping, Splicing and Poly(A) Sites David A. Konkel, Shirley M. Tilghman* Philip Leder Laboratory of Molecular Genetics National Institute of Child Health and Human Development Bethesda, Maryland 20014

and

Summary We have determined the entire nucleotide sequence of a cloned j3-globinma’ gene derived from the BALB/c mouse. This sequence is 1567 bases long and includes the 5’ cap region as well as the presumptive poly(A) addition site of @globin mRNA. The sequence establishes the fact that the gene is encoded in three discontinuous segments of DNA interrupted by two intervening sequences and precisely locates each. The smaller intervening sequence, 116 bases long, occurs between Arg and Leu codons at codon positions 30 and 31. The larger intervening sequence of 646 bases also occurs between Arg and Leu codons, but at codon positions 104 and 105. There is striking homology between the borders of the two intervening sequences, but no extensive dyad symmetry. Furthermore, the DNA region that just precedes and overlaps the 5’ cap structure of the mRNA shows homology to corresponding regions in other eucaryotic genes including the late adenovirus promoter. The 3’ untranslated sequence is closely homologous to that of the rabbit pglobin mRNA. The sequence thus allows us to identify several noncoding regions of potential importance for the expression and processing of genetic information. It also provides a basis for future comparison with other sequenced genes and a defined substrate for the development of direct tests of gene function.

The adult BALB/c mouse coordinately expresses two P-globin genes encoding the two polypeptides /3-globinmaJ and @globin”‘“. These genes reside on the two Eco RI fragments of genomic DNA that have been cloned in the bacteriophage A (Tilghman et al., 1977; Tiemeier et al., 1978). Both genes appear to be interrupted by two intervening sequences of DNA (ibid.) that are transcribed into the 1% globin mRNA precursor (Smith and Lingrel, 1978; Kinniburgh, Mertz and Ross, 1978; Tilghman et al., 1978b). Such structures have also been * Present address: The Fels Research Institute. Temple School of Medicine, 3420 N. Broad Street, Philadelphia, vania 19140.

University Pennsyl-

found in the rabbit p-globin genes (Jeffreys and Flavell, 1977) and in the genes of Drosophila rDNA (Glover and Hogness, 1977; Wellauer and Dawid, 1977; White and Hogness, 1977), adenovirus (Berget, Moore and Sharp, 1977; Kitchingman, Lai and Westphal, 1977; Chow et al., 1977), SV40 (Aloni et al., 1977), yeast tRNA (Goodman, Olson and Hall, 1977; Valenzuela et al., 1978), mouse immunoglobulin (Brack and Tonegawa, 1977) and chicken ovalbumin (Breathnach, Mandel and Chambon, 1977; Doel et al., 1977; Lai et al., 1978; Weinstock et al., 1978). They apparently represent a general feature of eucaryotic gene organization and define an additional step required for the correct processing of genetic information -the deletion of intervening sequences from RNA precursors and the precise splicing of internal segments to form the mature RNA species (Knapp et al., 1978; O’Farrell et al., 1978). The mouse P-globin system provides a useful evolutionary model for understanding the structural requirements for this reaction and for the regulation of coordinately expressed genes. As visualized by heteroduplex analysis, both genes are embedded in nonhomologous segments of genomic DNA, but appear to have preserved homology-in addition to their coding sequences-in the few hundred bases bordering the structural genes and their intervening sequences (Tiemeier et al., 1978). Such comparisons suggest an essential role for these preserved sequences in both the joint expression of these genes and the processing of their transcripts. We now report the complete nucleotide sequence of the cloned mouse P-globinmai gene. The sequence allows us not only to identify the gene and precisely locate the two intervening sequences that interrupt it, but also to identify several interesting structural features of the interrupted and joined regions and the regions surrounding the 5’ cap structure and the 3’ poly(A) terminus of the mRNA. Results

and Discussion

Sequencing Strategy, Procedure and Accuracy The 7 kb Eco RI DNA fragment containing the globin structural gene was cloned from BALB/c mouse as hgtWESMPG2 (Tilghman et al., 1977) and is represented in Figure 1. To expedite fragment preparation and reduce restriction endonuclease consumption, the entire RI fragment was subcloned into pMB9 (Bolivar et al., 1977a) while the two Hind Ill-Barn fragments were subcloned into pBR322 (Bolivar et al., 1977b) (see Figure 1). A detailed restriction map (4 and 5 base recognition enzymes) was developed for the 800 bp (5’) Hind Ill-Barn fragment and the 1500 bp (3’) Barn-Xba

Cell 1126

mologous sequences which differed bases, it would be prudent to verify quencing.

Figure

1. Strategy

for Sequencing

the Mouse

P-GlobinmaJ

Gene

The upper panel represents the cloned Eco RI fragment, with a magnified detail of the -2.3 kb Hind Ill-Xba fragment enclosing the entire globin structural gene. The sequence is measured in centabases (cb) in the 5’ to 3’ direction relative to the coding sequence. The cleavages used in developing the sequence are represented below. The Hind Ill-Barn (5’) or Barn-Xba (3’) fragments were cleaved and kinased at the sites indicated by the base of each horizontal arrow. The kinased fragment was then recleaved at the restriction endonuclease site shown by the point of each arrow. The resulting fragment was then sequenced by the Maxam-Gilbert (1977) technique and read as far as the box in the indicated direction.

fragment using the partial restriction mapping techniques of Smith and Birnstiel (1976) (see Figure 1). To simplify the restriction pattern, the sequence was developed starting with these two fragments using the partial chemical degradation method of Maxam and Gilbert (1977) and, in some cases, the thin polyacrylamide gel system of Sanger and Coulson (1976) to increase resolution. Regions were sequenced in both 5’ + 3’ and 3’ --* 5’ directions wherever feasible, especially in the intervening sequences where there is no amino acid sequence with which to correlate. The cleavages used in determining the nucleotide sequence are shown in Figure 1. In a long nucleotide sequence, it is important to assess the accuracy of the sequence determined. Ideally, it should be 100%; however, this degree of accuracy is seldom attained. The distribution of restriction sites dictates the length of the fragment to be sequenced; maximum accuracy is in the region between 10 and 140 bases away from the kinased end if thin gels (0.3 mm) are used, or between 10 and 60 for thick gels (1.5 mm). We have found that the accuracy of a sequence run only in one direction on thick gels may be as low as 65% if there is no corresponding amino acid sequence to guide difficult choices. The usual mistakes are incorrect C versus T assignments or reading fewer bases in a homogeneous run than are actually present. The accuracy of a region sequences in both directions on thick gels, or in one direction on thin gels, is about 99%; however, this translates to 15 errors in a 1500 bp sequence. To increase the accuracy above this level would generally not be worthwhile, but if regions of interest were in ho-

in only a few both by rese-

Nucleotide Sequence of the pmai Gene The complete sequence of the mouse p-globin”“J gene is shown in Figure 2. The sequence determined begins 70 nucleotides before the 5’ cap site (see below) and extends 231 bases beyond the 3’ translational termination codon of p-globin in mRNA. Coding sequences can be correlated with the amino acid sequence already determined for the BALB/c globin pmaJ polypeptide by Gilman (1972) and Popp and Bailiff (1973). [These amino acid determinations disagreed at two of 146 positions. Our sequence agrees with that of Gilman (1972) at the disputed positions.] Since there are six amino acid sequence differences between pmaJ and pm’“, we can now identify the gene cloned within the 7 kb Eco RI fragment (hgtWESMPG2) as pmaJ. The obvious expectation is that the BALB/c P-globin gene encoded in a second, 14 kb Eco RI fragment (AgtWESMPG3) corresponds to pm’” (Tiemeier et al., 1976). Partial nucleotide sequencing (D.A.K., unpublished data) indicates that this is indeed the case. As suggested by electron microscopic analysis (Tilghman et al., 1978a), the pmaJ gene is interrupted by two intervening sequences of DNA dividing the coding sequence into three discontinuous blocks. The evolutionary significance of these separated blocks and the unique intervening sequences that divide them is still a matter for speculation (Miller, Konkel and Leder, 1978). The first block encodes amino acids l-30 (or 29, see below); the second, 30-104; and the third, 105-146. It is difficult to assign segregated functional or structural properties to the portions of the globin polypeptide chain encoded in each block. Each polypeptide segment contains helical regions interrupted by nonhelical portions, and each segment, arguing by analogy to human variants, accumulates mutations. The two histidines that coordinate heme are located at positions 63 and 92 and reside within the second coding block, while all the amino acid differences between pmaJ and pm’” occur in the first two blocks. Although the central portion provides the heme binding center, there is no evidence of its independent evolution as a heme binding peptide. Indeed, the relationship between the globins and myoglobins suggests that all the coding blocks have evolved together. When the genomic globin sequence is compared with that of the 5’ untranslated region of p-globin mRNA as determined by Baralle and Brownlee (1978), it forms a continuous and perfect match over the 52 nucleotides available for comparison. The pmaJ gene thus lacks spliced-out leader se-

Chromosomal 1127

Mouse

p-Globin

10 0 100 200

Gene

40

30

20

50

60

70

80

90

CAP GGccAAtcTGcTcAcAcAGGAtAGAGAGGGCAGGAGCCAGGCAGAGCATATAAGGTGAGGTAGGAtCAGTTGCtCCtCACAtTTGCTTCtGACATAGTTG TGTTGACTCACAACCCCAGAAACAGACATCATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTGTGGGGAAAGGTGAACTCCGATGAAG MetValHisLeuThrAspAlaGluLysAlaAlaValSerCysLeuTrpGlyLysValAsnSerAspGluV TTGGTGGTGAGGCCCTGGGCAGGTTGGTATCCAGGTTACAAGGCAGCTCACAAGAAGAAGTTGGGTGCTTGGAGACAGAGGTCTGCTTTCCAGCAGACAC

alGlyGlyGluAlaLeuGlyArg

30

300

TAACtTtCAGtGTCCCCTGTCTATGtttCCCTTTTTAGGCTGCTGGtTGTCTACCCTTGGACCCAGCGGtACTTTGATAGCTTTGGAGACCTAtCCTCTG 31 LeuLeuValValTyrProtrpThrGlnnrgTyrPheAspSerPheGlyAspLeuSerSerA

400

CCTCTGCTATCATGGGTAATGCCAAAGTGAAGGCCCATGGCAAGAAGGTGATAACTGCCTTTAACGATGGCCTGAATCACTTGGACAGCCTCAAGGGCAC 1~SerAlaIleMetGlyA~nAlaLysValLysAlaHisGlylysLysValIleThrAlaPhe~~~nAspGlyLeuAsnHi~LeuAspSerLeuLysGlyTh

500

CTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAGGGTGAGTCTGATGGGCACCTCCTGGGTTTCCTTCCCCTGC rPheAlaSerLeuSerGluLeuHisCy5AspLysLeuHisValAspProGluAsnPheArq 104

600

tATTCTGCTCAACCTTCCTATCAGAAAAAAAGGGGAAGCGATtCTAGGGAGCAGtCTCCATGACTGTGtGtGGAGTGTTGACAAGAGttCGGATATTTTA

700 800 900 1000

1100 1200

1300 1400

1500 31

105

104

tar

IVS 2

IVSl

I 1567

0 Figure

2. Nucleotide

Sequence

of the Mouse

pmaJ Gene

The nucleotide sequence (of the strand corresponding to the mRNA) is displayed 5’ to 3’ on the top line. The amino acid sequence is displayed on the line below the coding sequence and corresponds to that derived from the determinations of Gilman (1972) and Popp and Bailiff (1973). The symbol “CAP” represents the start of the capped mRNA as determined by Baralle and Brownlee (1976). The numbers bordering the sequence reference nucleotide position. The numbers inside the sequence reference adjacent amino acid positions. The initial Met is not associated with the final polypeptide product. Ter refers to the termination codon UAA. The symbol “PA” represents the putative site of poly(A) addition as established by comparison to the rabbit p-globin mRNA sequence (Efstratiadis et al., 1977; see also (Figure 6).

quences that occur between the cap structure and the initiator ATG of certain viral genes (Aloni et al., 1977; Berget et al., 1977; Chow et al., 1977) and the chicken ovalbumin gene (Breathnach et al., 1978; Dugaiczyk et al., 1978). All interruptions of the globin gene appear to occur within the coding segments. The entire sequence has also been subjected to computer analysis using the program of Korn, Queen and Wegman (1977) to search for unusual codon selections, symmetries, repeats and homologies. The codon usage frequency follows that seen with rabbit and human p-globin (Efstratiadis, Kafatos and Maniatis, 1977; Kafatos et al., 1977). Codons ending in A are grossly underutilized (5% of the total), while the Leu codon CTG, the Arg codon AGG and the Val codon GTG are greatly preferred over others for the same residue. As with most eucaryotic sequences, the dinucleotide CG is also grossly under represented, comprising only

10% of the frequency predicted from the nucleotide distribution. The three coding blocks have an overall GC content of 58, 52, and 58% (5’ to 3’), while the two intervening sequences have a GC content of 48 and 41% (5’ to 3’). The 5’ and 3’ flanking sequences, including the transcribed-untranslated regions, are 47 and 37% GC, respectively. Thus this gene consists of translated regions in which GC predominates, imbedded in and interrupted by regions of relatively lower GC content. As more genomic sequences become available, it will be interesting to see whether this proves to be a common pattern.

The Two Intervening Sequences Identical Joint Sequences

Have Virtually

Apart from questions of evolutionary and physiological significance, the presence of both globin intervening sequences in the 15s globin mRNA precursor (Smith and Lingrel, 1978; Kinniburgh et

Cell 1128

al., 1978; Tilghman et al., 1978b) poses the immediate problem of understanding the structural and enzymatic basis for deleting these sequences and precisely rejoining the resulting internal segments of mRNA. Unlike the case of the small intervening sequences that occur in certain yeast tRNAs (O’Farrell et al., 1978; Valenzuela et al., 1978), a computer-assisted search reveals no convincingly stable dyad symmetries that will draw both intervening sequence joints into the stem or stem and loop arrangement which one might have expected. We have also searched for segments elsewhere in the flanking, coding and intervening sequences that are complementary to the intervening sequence borders and might facilitate splicing by drawing these borders together. While there are several regions of limited complementarity, there is no single region that seems capable of performing this function. Instead, there is a striking sequence homology in the regions that border both intervening sequences (Figure 3A). An 8 (of 10) base homology occurs between the two 5’ coding-intervening sequence borders and another 8 (of 10) base homology occurs between the two 3’ coding-intervening sequence borders. Although there is some

Figure 3. Comparison of the Junctions of the Two prna Intervening Sequences and of Spliced Sequences in pm@ mRNA (A) Numbers in the figure refer to amino acid codon positions. Sequences are drawn from positions 220-228. 337-346,55&567 and 1204-1212 in Figure 2. IVSl is the small intervening sequence: IVS2 is the large intervening sequence. (6) The two sequences are formed by joining the coding sequences specifically between the 5’ AGG codon and the first Leu codon. As discussed in the text, other splicing frames are possible, preserving the coding specification (see Figure 4). The two sequences are identical at lO/ll positions.

choice in the splicing frame used to join coding sequences (see below), when the joints between the two intervening sequences are spliced, both can form sequences identical at 10 of 11 base positions (CAGGCU!$UGG) that encode the tripeptide Arg Leu Leu repeated at positions 30-32 and 104-106 in the fl-globin chain (Figure 38). That both /3-globin intervening sequences have virtually identical structures at their borders and in the final spliced sequence is likely to be of significance in their joining. Indeed, a comparison of these regions in the mouse sequences thus far determined reveals a conservation of sequence, especially at the 5’ border (Table 1). Although the sequence information is obviously limited, a tentative ancestral sequence can be derived-TCAGGT at the 5’ border, CAGG at the 3’ border-that fits well with the rabbit p-globin (van den Berg et al., 1978; A. Efstratiadis, E. Lacy and T. Maniatis, personal communication), the chicken ovalbumin (Breathnach et al., 1978; Catterall et al., 1978) and the putative silk fibroin (V. Suzuki, personal communication) border regions. Such conservation suggests the possible existence of a common splicing mechanism among organisms that diverged hundreds of millions of years ago. Nevertheless, this small core segment of homology probably does not constitute a sufficient signal for processing. There are several CAGG coding sites in the coding and intervening sequences with >75% homology to the actual joint regions. Interestingly, a comparison of the p-globin sequences of mouse, rabbit and human mRNAs (Kafatos et al., 1977) reveals that nucleotide substitutions, both silent and codon-altering, are not randomly distributed. In particular, the coding regions surrounding the two intervening sequences are closely preserved and appear to constitute a “cold spot” subject to special constraints. (These regions, 27 bp on either side of IVSl , and the 39 bp preceding and the 11 bp following IVS2, are >98% homologous among these species, whereas the overall coding sequence homology is only 80%) Given the impressive homology between regions that border both intervening sequences, the possibility that the first coding block will occasionally be joined to the last coding block must be considered (that is, codon 30 would be adjacent to codon 105). Whether this ever occurs, resulting in a short mRNA encoding a protein half the size of p-globin, is not known. It is probable that globin precursor RNA as a higher order structure specified by the “cold spot” noted above that constrains the permitted combinations for the excision-rejoining reaction. In a large mRNA with many intervening sequences, any one of which could create missense, a high degree of precision is obviously required.

Chromosomal 1129

Table

Mouse

1. Comparison

p-Globin

Gene

of Globin

and lmmunoglobulin

Coding:

Intervening

Sequence

Borders

5’ Mouse

pm”

Mouse

pmd

Mouse

3’

GCAGGT-TG

IV.91

. TTAGGCTGCTG

.TCAGGGTG

IVS2

ACAG-CTCCTG

pi”

TCAGGGTG

IVS2.

Mouse

(I

TCAGGTAT

IVS2.

Mouse

A,

TCAGGTCA

IVSl

.

Mouse

A,

CTAGGTGA

IVS2

Mouse

~2

. TCAGGTCA

IVSl

Prevalent

Sequence

? ? GCAGGGGCCA . CCTGCGGCCA .

TCAGGT

GCAGGAGCCA CAGG

The table aligns the borders of the known mouse intervening sequences to show maximum homology. The abbreviations IVSl and IVS2 refer to the first and second intervening sequences from 5’ to 3’ relative to the mRNA sequence. The mouse firnl” sequence was determined by D.A.K. (unpublished data), the mouse (I sequence was determined by D. Hamer (Leder et al., 1978). the mouse A, sequences were determined by S. Tonegawa (personal communication) and the A2 sequence by Tonegawa et al. (1979).

Possible Splicing Frames It is clear that the integrity of @globin and the nature of the genetic code require a precise joining of internal coding segments of globin mRNA. A certain tolerance, however, may be built into the joint sequence. A splice to remove the first intervening sequence can be made in any of seven frames and yet preserve the proper amino acid coding sequence (Figure 4). The three possible 3’ frames involve using a Leu codon (UUG) different from that generated by a splice occurring in one of the four 5’ frames (CUG). This flexibility derives in part from the repetition of an Arg codon (AGG) at both 3’ and 5’ intervening sequence joints, a feature also seen in rabbit globin (van den Berg et al., 1978; A. Efstratiadis, E. Lacy and T. Maniatis, personal communication). The splice required to remove the second intervening sequence is more restrictive, allowing only two cutting frames. Characteristics of the Intervening Sequences The two intervening sequences differ in both length (116 versus 647 bp) and overall nucleotide sequence. The larger intervening sequence is particularly AT-rich (58%) with Ts predominating (38%) on the strand corresponding to the precursor sequence (Figure 2). These occur in repeated runs that would principally intersperse the larger intervening sequence of the precursor with oligo(U)s. The smaller intervening sequence has a more heterogeneous base composition and distribution, suggesting that the gross features of its higher ordered structures would differ from that of the larger. While the evolutionary relationship between the two sequences is unclear, the computer analysis shows many short (8-20 bp) imperfect homologies (>75%) between the two regions that cover 80% of the small intervening sequence and 60% of the larger.

5'

28 29 30 LEU GLY ARG CTG GGC AGG TTG GTA

41 I 1 I’ I’-

CTT TTT AGG CTG CTG ARG LEU LEU 30 31 32

5'

3'

102 103 104 ASN PHE ARG AAC TTC AGG GTG AGT

IVS2 TTC CCA CAM

3' LEU LEU 105 106

Figure 4. intervening

Possible Splicing Frames Excising Sequences of Mouse p-GlobinmaJ

Small

and

Large

Numbers refer to amino acid positions as shown in Figures 2 and 3. IVSl and 2 are the small and large intervening sequences, respectively. The overlapping lines represent possible splicing frames preserving the known amino acid sequence.

Both intervening sequences contain initiation (ATG) and termination codons (TAA, TGA, TAG). The smaller intervening sequence has only one initiator and two terminators, neither of which is in phase with the globin coding sequences. The translation of an unprocessed precursor would produce a polypeptide 99 amino acids long, terminated out of phase within the second coding block at nucleotide positions 428-431 (Figure 2). The larger intervening sequence contains eight initiation and 28 termination codons. Both are

Cell 1130

present in all three phases and several are in phase with one another so as to encode potential polypeptides. If the smaller intervening sequence were excised and the larger were not, the resulting RNA would encode a polypeptide 192 amino acids long, containing the initial two thirds of the globin polypeptide and continuing until terminated within the larger intervening sequence at nucleotide positions 829-831 (Figure 2). Identification of Capping Site Homologies and Potential Promoter Regions We have noted above that the pmaJ gene encodes a continuous sequence of 52 bases corresponding to the capped 5’ portion of the pmai mRNA sequence (Baralle and Brownlee, 1978). In fact, both pmaJ and pm’” genes display close homology in this region (Tiemeier et al., 1978; also Figure 5). Ziff and Evans (1978) have recently determined the structure of the 5’ terminal capped oligonucleotide of several adenovirus late mRNAs and have shown that this sequence is identical to a region of the adenovirus genome containing the late promoter. Their structure contains a 10 (of 12) base homology to the cap region of pmaJ (Figure 5), a 9 (of 12) base homology to the cap region of pmin and an 8 (of 12) base homology to the cap region of rabbit /3-globin (van den Berg et al., 1978; A. Efstratiadis, E. Lacy and T. Maniatis, personal communication). While the 5’ sequence of mouse A light chain (immunoglobulin) mRNA is not known, a variable region gene cloned by Tonegawa et al. (1978) also contains an untranslated sequence 34 nucleotides 5’ to the initiator codon that is homologous to the cap region of pmaJ in 8 (of 12) positions and to the cap region of rabbit p-globin in 10 (of 12) positions. While the globin promoter has not yet been identified, the 15s globin mRNA precursor contains a capped oligonucleotide that cannot be distinguished chromatographically from that of 10s globin mRNA (Curtis et al., 1977). This precursor also forms a smooth R loop with the pmai gene (Tilghman et al., 1978b). Since the 15s RNA is the longest globin precursor detected (Ross, 1976; Curtis and Weissmann, 1976; Kwan, Wood and Lingrel, 1977; Haynes et al., 1978), it is possiblealthough by no means demonstrated-that its 5’ sequence corresponds to the initially transcribed sequence. It is interesting to note that the bulk of the homology in this capping region is upstream of the actual cap site. If the cap is attached to the pppA resulting from the transcriptional initiation event, then this highly conserved region upstream of the initiation site might represent the polymerase binding site, possibly a common promoter signal for the transcription of mammalian unique-copy genes.

CAP + t+use

B globinmi”

(43tlT)

, . .ATG

(431lT)

, , .ATG

Adenovirus+

(

Rabbit

(44NT)

, , .ATG

(27NT)

. ..ATG

nOuse

8 globin+ i light

chain’

Figure 5. Comparison of the Regions Potential Cap Sites of Several Genes

Surrounding

?

)...ATG

the Cap and

The boxed sequences contain homologies indicated by vertical lines. CAP J denotes the nucleotide correshonding to the penultimate nucleotide of the capped mRNA except for the A light chain where it is not yet known. The numbers in parentheses refer to the distance in nucleotides from the initiating codon of the mRNA. The sources of genomic and mRNA sequences are as follows: mouse pmaJ and pm’” genomic (this paper and D. A. K.. unpublished data), pnal and p”‘” mRNA (Baralle and Brownlee, 1978). adenovirus late genes and mRNA (Ziff and Evans, 1978), rabbit fi-globin genomic and mRNA (Efstratiadis et al., 1977; A. Efstratiadis, E. Lacy and T. Maniatis, personal communication), mouse A light chain (Tonegawa et al., 1978).

The 3’ Untranslated Region and Presumptive Poly(A) Addition Site Since the sequence of the 3’ untranslated segment of mouse P-globin mRNA has not yet been determined, the location of the poly(A) addition site in the mouse gene must be inferred by comparison to the corresponding sequence for the rabbit (Efstratiadis et al., 1977). There is substantial homology between these sequences (Figure 6). In particular, there is 72% homology to the last 29 bases of the rabbit mRNA, including the pentanucleotide AATAA found near the end of the 3’ untranslated region in all eucaryotic mRNAs examined thus far (Proudfoot and Brownlee, 1976). Homology to the human mRNA is 83% in this region (Marotta et al., 1977). In all three genes this pentanucleotide occurs 18 bases before the triplet TGC, which represents the poly(A) addition site in the rabbit and human mRNAs. We suggest that this triplet will prove to fill the same role in the mouse gene. The 3’ untranslated region of the mouse gene is 133 bases long, while the corresponding rabbit region is 95 bases long. The two sequences show substantial homology throughout this region, the bulk of the extra bases in the mouse gene occurring in two blocks of 10 and 28 bases. Excluding these, overall homology is 59%. Some Tentative Conclusions The sequence of the p-globin”“’ gene represents the first complete sequence determined for a relatively unique chromosomal gene derived from a eucaryotic organism. As such it provides a basis for comparison with other chromosomal genes whose sequences will doubtless become available in the near future. Even with the limited sequence

Chromosomal 1131

Mouse

Mouse

RmaJ Gene

Rabbit

B mRNA

@Globin

Gene

, , ,TACCACTAA-ACCCCCTTTCCIIIIIII I I I IIIII , t ,UACCACUGAGAUCUU-UUUCCCUCUGC TyrHi sTer

GAGAGCATC(-28HT-ITTGAAAATCTG LddCccC Figure mRNA

6. Comparison

TGC(-TOHT-JGAACAATGGTTAATTG-TTCCCAA

III1

II III

II I I

CAAAAAU-

UA-UGGGGACAUCAU

TCTTCTGACAAATAAAAAGCATTTATGTTCACTGC,,

/I/II

I I IIIIII

III

UUGAGCAUCUGACUUCUof the 3’ Terminal

II I

Portion

of the Mouse

IllIll

,

III1

Ill

GGCUAAUAAAGGAAAUUUAUUUUCAUUGCPOTYA @Globin

mar Gene to the 3’ Poly(A)

Terminal

Portional

of Rabbit

6-Globin

The two sequences were aligned to maximize homology with the aid of the computer program of Korn et al. (1977). Vertical lines indicate corresponding bases. Dashes indicate “deletions” (relative to the corresponding sequence) required for maximum homology. Long insertions in the mouse gene are denoted only by the number of bases contained. Their sequences are indicated in Figure 2. The ubiquitous 3’ terminal pentanucleotide AATAA is overscored. The sequence shown represents bases 1328-1466 of the mouse gene (Figure 2) and 433-533 of the rabbit mRNA (Efstratiadis et al., 1977).

information available, several tentative conclusions can be drawn regarding the functional and evolutionary significance of various segments of the gene. As we have pointed out, there is a 12 base “capping box” which just precedes and overlaps the initial nucleotide sequence of P-globin mRNA and which is homologous to an analogous region identified with the late adenovirus promoter (Figure 5). This region may well prove to be a transcriptional initiation site, a possibility that is open to experimental test. Furthermore, the sequence reveals that the two intervening sequences that interrupt the gene differ in both length and sequence, but retain close homology at their coding:intervening sequence borders and share elements of this homology with similar regions of other vertebrate genes (Table 1). The fact that intervening sequences occur at identical positions in mouse and rabbit @globin genes and in at least one of the interrupted positions of the mouse cu-globin gene suggests that they have a critical role in the expression of these genes, and that they are eliminated from mRNA precursors by mechanisms which have been preserved through hundreds of millions of years of evolution. A region of homology with the 3’ poly(A) terminus of rabbit globin mRNA putatively identifies the 3’ poly(A) addition site of mouse P-globin mRNA (Figure 6). Again the preserved sequences suggest a required role, possibly for poly(A) addition or transcriptional termination or both. The preserved regions that precede, interrupt and follow the coding sequences obviously represent additional targets for mutation, and further globin gene comparisons might provide a molecular explanation for a variety of inherited anemias (Leder, 1978). In addition, the tentative identification of functional regions and their availability as substrates for both in vitro and in vivo experiments raises the possibility of developing direct functional tests for each of these important regions.

Exparlmental

Procedures

Matarlalo Y-~~P-ATP (spec. act. -3000 Ci/mmole) was purchased from Amersham (Chicago, Illinois). Restriction endonucleases, T4 DNA ligase and polynucleotide kinase were from New England Biolabs (Beverly, Massachusetts), except Sau 3A and Ava II, which were from Bethesda Research Laboratories (Maryland). Bacterial alkaline phosphatase (BAPF) was from Worthingtqn Biochemicals, dialyzed exhaustively against 100 mM Tris-HCI (pH 6.0) to remove ammonium sulfate. pMBS.PG2 and pBR322.6G3 were grown under P3 conditions in x1776 (Curtiss et al., 1977) in Difco Brain Heart infusion supplemented with 50 *g/ml Sigma thymidine and diaminopimelic acid, with 10 pg/ml tetracyline or ampicillin, respectively. DNA was purified according to the procedure of Clewell and Helinski (1969) using the NaOH precipitation of linear DNA and chromatography on Sepharose 28.

Restrktlon

Endonuclaaea

Dlgastion

DNA was digested as described by Tilghman et al. (1976a). Conditions for Barn HI were used for Alu I, Hae Ill, Hinf I and Ava II. Sau 3A digestions were in the same buffer but at 30°C. Digestions with Mbo II were in the same buffer but without NaCl.

Praparatlve

Polyacrylamfdo

Gel Elactrophorerls

DNA was electrophoresed in 15 x 30 cm x 1.5 mm slab gels of 512% acrylamide (i/30 as bisacrylamide) in 50 mM Tris-borate (pH 6.3) 2.5 mM EDTA (1 x TBE). Samples to be eluted were cut out, put in 1 x TBE in dialysis bags and electrophoresed for 4-16 hr at 150 mA. The eluted DNA was purified over DEAE-cellulose (~3 kb fragments) or HAP (>3 kb fragments) and then ethanol-precipitated.

I’-Terminal

Laballng

of Restrlctlon

Fragments

Fragments were treated with BAP in 100 mM Tris-HCI min at 10°C and 55 min at 3PC, then extracted with phenol:chloroform:isomyl alcohol (1:1:0.04) and ethanol-precipitated. End-labeling was as described and Gilbert (1977), but at pH 8.5 without spermidine.

(pH 8) for 5 chloroform, ether, and by Maxam

DNA gaquenclng

sequencing

was performed according to the method of Maxam and Gilbert (1977) using 8 and 24 hr 20% urea-acrylamide gels and 10 hr 10% gels. In some cases, 0.3 mm thin gels were used according to the procedure of Sanger and Coulson (1978) for increased resolution. In these cases, the alternate A and G reactions were used, and samples were loaded in 90% formamide (deionized) rather than urea-NaOH (which caused smearing). Thin

Cdl 1132

gels also gave poor resolution if the gel or gel solution temperature for more than 24 hr before use.

sat at room

We are most grateful to Drs. J. G. Seidman. M. Cashel and A, Maxam for patiently instructing us in the sequencing work. We also wish to thank Dr. J. V. Maizel, Jr. for implementing the computer analysis program and for instructing us in its use. We are most grateful to Ms. Terri Broderick for her expert assistance in the preparation of the manuscript. We would also like to thank Drs. Ft. Evans, E. Ziff, T. Maniatis, A. Efstradiatis and E. Lacy for making their results available to us prior to publication. All experiments were carried out in accordance with the NIH Guidelines on Recombinant DNA Research. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 16 U.S.C. Section 1734 solely to indicate this fact. September

Aloni, Y., Dhar, R., Laub, O., Horowitz, Proc. Nat. Acad. Sci. USA 74,3686-3690. F. E. and Brownlee,

Berget, S. M., Moore, Sci. USA 74.3171-3175. Bolivar. (1977a).

M. and Khoury,

G. G. (1976).

C. and Sharp,

F., Rodriquez, Gene 2,75-93.

G. (1977).

Nature274,

P. A. (1977).

R. L.. Betlach,

64-67.

Proc.

M. C. and

Nat. Acad.

Boyer,

H. W.

Bolivar, F.. Rodriquez. R. L., Greene, Heyneker, H. L., Boyer, H. W., Crosa. (1977b). Gene2, 95-113.

P. J.. Betlach, M. J. H. and Falkow,

Brack, C. and Tonegawa, 5652-5656.

S. (1977).

Nat. Acad.

Breathnach, 270. 314-319.

J. D. and Chambon.

R., Mandel.

Proc.

C.. S.

Sci. USA 74,

P. (1977).

Nature

J. F., O’Malley, Y. and Brownlee, L. T., Gelinas, Cell 72, l-8.

B. W., Robertson, M. A., Staden, G. G. (1976). Nature257, 510-515. R. E., Broker,

Clewell, D. and Helinski, 1159-1166. Curtis, Harbor

P. J.. Mantei, Symp. Quant.

Curtis, 1075.

P. J. and

D. (1969).

T. R. and

Proc.

Nat. Acad.

N. and Weissmann, Biol. 42, 971-984.

Weissmann.

Roberts,

C. (1977).

C. (i976).

J. Mol.

R., R. J.

M. T., Houghton, M., Cook, Acids Res.4, 3701-3713.

Dugaiczyk, McReynolds, Efstratiadis, 571-565. Gilman, Glover,

Knapp, G., Beckman, J. S.. Johnson, Abelson, J. (1976). Cell 74, 221-236. Korn, Acad.

J., Queen, C. L. and Sci. USA 74,4401-4405.

F. C. and

Maniatis,

J. G. (1972).

Science

Goodman, Aced. Sci.

V. F., Jr., in press.

Jeffrey%

A. J. and Flavell,

Kafatos,

F. C.,

Efstratiadis.

Nat.

S. A. and

M. N. (1977).

Proc.

Nat.

J. 8. (1977).

Proc.

Nat.

Lingrel,

A., Catterall, J. F. and Sci. USA 75, 2205-2209.

Leder,

1079-1080.

P. (1976).

New Eng. J. Med.298,

Leder, A., Miller, H.. Hamer, D., Seidman, T. G., Sullivan, M., and Leder, P. (1976). Proc. Nat. Acad. press.

A. and Gilbert,

Miller, 774.

H. I., Konkel.

O’Farrell, Goodman, Popp, 61-67.

B. A. and Weissman,

Proc.

D. A. and Leder,

Nat. Acad.

P. (1976).

P. Z., Cordell, B., Valenzuela. H. M. (1976). Nature274.436445.

R A. and Bailiff,

Proudfoot, Ross,

W. (1977).

E. G. (1973).

N. J. and Brownlee,

J. (1976).

Norman, E., Sci. USA, in

B.,

Sci.

USA 74,

Nature275,

772-

Rutter,

W. J. and

Biophys.

Acta303,

Nature263,

211-214.

Biochim.

G. G. (1976).

S. M.

J. Mol. Biol. 186, 403-420.

Sanger,

F. and Coulson,

Smith, 2398.

H. 0. and Birnstiel.

A. R. (1976).

Smith,

K. and Lingrel,

FEBS Letters87,

M. L. (1978).

J. B. (1978).

Nucl.

Nucl. Acids

Acids

107-110. Res.3,

2367-

Res. 5, 3295-3301.

Tilghman, Seidman, P. (1977).

S. M., Tiemeier, D. C., Polsky. F. I., Edgell, M. H., J. G., Leder, A., Enquist, L. W., Norman, B. and Leder. Proc. Nat. Acad. Sci. USA 74, 4406-4410.

Tilghman, S. M., Tiemeier, D. C., Seidman, Sullivan, M., Maizel, J. V., Jr. and Leder, Acad. Sci. USA 75, 725-729.

J. G., Peterlin, B. M., P. (1978a). Proc. Nat.

Cold

Tonegawa, W. (1976).

S., Maxam, A. M.. Tizard. R., Bernard, 0. and Gilbert, Proc. Nat. Acad. Sci. USA 75, 1465-1489.

Spring

706,

1061L.

N. H. (1977).

T. (1977).

Cell 10,

D. C., Leder, P. L. and Sci. USA75, 1309-1313.

Valenzuela, P., Venegas, A., Weinberg, F., Bishop, R. and Rutter, W. J. (1978). Proc. Nat. Acad. Sci. USA 75, 190-194. van den Grosveld, 37-44.

Berg, J.. van Ooyen, A.. Mantein, N., Schambock, A., G., Flavell, R. A. and Weissmann. C. (1978). Nature278,

Weinstock, R., Sweet, (1978). Proc. Nat. Acad. Wellauer, White,

R., Weise, M., Cedar, Sci. USA 75, 1299-1303.

P. K. and Dawid, R. L. and Hogness.

H. and

I. 8. (1977).

Cell 70, 193-212.

0. S. (1977).

Cell 10, 177-192.

R. M. (1978).

Axel,

R.

Cell 15, 1463-1475.

178, 666-674. D. S. (1977).

H. M., Olson, M. V. and USA 74, 5453-5457.

Haynes, J. R., Kalb, (1976). FEBS Letters,

Proc.

Lai, E. C.. Woo, S. L. C., Dugaiczyk, O’Malley. B. W. (1976). Proc. Nat. Acad.

Ziff, E. B. and Evans,

D. M. and Hogness.

H. (1977).

P. F., Fuhrman,

Wegman,

Kwan, S.-P., Wood, T. G. and Acad. Sci. USA 74, 176-162.

Cell 74, 661-

S. M., Curtis, P. J., Tiemeier, C. (1976b). Proc. Nat. Acad.

Biol.

E. A. and Carey,

J. (1976).

Tilghman, Weissmann,

A., Woo, S. L. C.. Lai. E. C., Myles. M. L., Jr., L. and O’Malley, B. W. (1976). Nature 274, 326333. A., Kafatos.

5616-5622. Ross,

Sci. USA62,

Curtis% R., Ill, Inoue. M.. Pereira, D., Shu. J. C., Alexander; and Rock, L. (1977). Miami Winter Symp. 13, 99-111. Doel, Nucl.

J. E. and

Tiemeier. D. C., Tilghman, S. M., Polsky, F. I., Seidman, J. G., Leder, A., Edgell, M. H. and Leder. P. (1978). Cell 14, 237-245.

Breathnach, R., Benoist, C., O’Hare, K., Gannon, F. and Chambon, P. (1976). Proc. Nat. Acad. Sci. USA 75, 4653-4657. Catterall, Tanaka,

Sci. USA74,

Kitchingman, G. R., Lai, S.-P. and Westphal, Acad. Sci. USA 74: 4392-4395.

Maxam, 560464.

Baralle,

Nat. Acad. A. J., Mertz,

Marotta, C. A., Wilson, J. T., Forget, (1977). J. Biol. Chem.252, 5040-5059.

11,1976

References

Chow, (1977).

Proc.

Kinniburgh. 693.

Acknowledgments

Received

(1977).

Losteck,

R. A. (1977). A., Forget,

Cell 10, 167-176.

Hall,

B. (1977).

P., Jr. and

Proc. Lingrel,

Note Added

J. 6.

Cell 12, 1097-1106. B. A. and Weissman,

In Proof

Nat.

S.

D. Hagness and M. Goldberg (personal communication) have noted a canonical sequence, TATAAATA, that occurs -80 nucleotides to the 5’ side of several histone genes in Drosophila. An analogous sequence, TATAAGGT, begins 81 nucleotides to the 5’ side of the mouse P-globin”‘coding sequence (Figure 2).