Biochimica et BiophysicaActa 910 (1987) 261-270 Elsevier
261
BBA91762
(A)GGG(A), (A)CCC(A) and other potential 3' splice signals in primate nuclear pre-mRNA sequences Ruth Nussinov Sackler Institute of Molecular Medicine, Sackler Faculty of Medicine, Tel A viv University, Ramat Aviv (Israel) and Laboratory of Mathematical Biology, National Cancer Institute, NIH, Bethesda, MD (U.S.A.) (Received 8 May 1987)
Key words: Splice signal; mRNA; Computer model; Lariat structure; Consensus sequence; (Primate DNA)
Several 3' splice signals in nuclear precursor mRNAs have already been known for some time: the AG doublet on the left-hand side of the splice and a run of pyrimidines just upstream of it. More recently it has been noted that the YNYTRAY sequence (where Y is a pyrimidine, R a purine and N any base) is a branching-sequence participating in formation of a lariat structure. Keller and Noon have shown the existence of several putative consensus sequences at this site. In this work, extensive computations of the distributions of 256 quartets in all primate nuclear pre-mRNA intron sequences present in GenBank have been carried out. Several putative signals upstream and downstream of the 3' splice have been detected. These have been compared with the results obtained in analogous computations carried out on all nuclear pre-mRNA introns present in a combined eukaryotic file containing mammal, non-mammalian vertebrate, invertebrate and plant sequences. The distributions of the more interesting oligomers are shown here. Of particular interest are the putative (A)GGG(A) signal 60 nucleotides upstream of the 3' splice site and (A)CCC(A) 3-40 nucleotides downstream of it. A possible splicing model explaining these data and involving formation of alterantive hairpin loop structures is proposed.
Introduction Two conserved sequence elements are present in the 3' splice sites, the pyrimidine-rich tract and the AG dinucleotide [1,2]. More recently it has been shown that intron splicing involves lariat formation, in which the 5' end of the intron is covalently linked to an adenine residue 18 to 40 nucleotides upstream of the 3' splice site [3,4]. Lariat formation occurs at a specific site within the human fl-globin intervening sequence 1 [5], the rabbit beta-globin intervening sequence 2 [6] and the adenovirus major late intervening seCorrespondence: R. Nussinov, Laboratory of Mathematical Biology, Bldg. 10, Room 4B-56, National Cancer Institute, NIH, Bethesda, MD 20892, U.S.A.
quence 1 [4]. By aligning the sequences near the 3' ends of the introns in a number of mammalian globin genes with the branch-point sequence of human beta-globin intervening sequence 1, a putative branch-point consensus sequence YNYTRAY (where Y is a pyrimidine, R a purine and N any base) has been derived [5,6]. Branching aparently occurs at the A residue within this consensus sequence. The branch-poirtt consensus sequence shares sequence homology with the highly conserved TACTAAC box, the site of lariat formation in yeast introns [7,8]. In the yeast, as in the mammalian branch-point consensus sequence, branching occurs at the A residue at position 6 in the TACTAAC sequence. The presence of a putative consensus branchpoint sequence upstream of the 3' splice junction
0167-4781/87/$03.50 © 1987 Elsevier Science Publishers B.V. (Biomedical Division)
262
suggests that this sequence plays a role in precursor m R N A splicing. However, unlike the yeast TACTAAC sequence, the putative branch-point sequences in higher eukaryotes are highly variable [3]. Nevertheless, the mechanism of precursor mRNA splicing appears to be highly considered throughout evolution. Reed and Maniatis [3] have shown that Drosophila ftz and actin precursor mRNAs are both spliced efficiently in HeLa cell nuclear extracts. Moreover, in order to determine whether specific intron sequences are important for splicing, and whether these sequences correspond to the previously proposed branch-point consensus sequence, Reed and Maniatis [3] have also analyzed the sites of lariat formation within introns from a number of different genes. Among the pre-mRNAs examined were several mammalian RNA precursors, a Drosophila RNA precursor and a precursor containing a synthetic intron. In each case they have found that a lariat is formed and they have identified the branch-point sequence. That sequence corresponds to the YNYTRAY consensus sequence note above. Since such a consensus sequence is quite ill-defined, it is unclear how it is recognized and used by the enzymes. It can hydrogen-bond with a variety of nucleotide sequences or stay singlestranded, displaying in turn a variety of RNA geometries. So far, probably the only known fixed characteristic of this sequence is the presence of an adenine residue at a specific position. Since the YNYTRAY branching-sequence can assume different structures, how is a specific A, among other A's in its vicinity, recognized? Moreover, the A utilized in the branching reaction is not necessarily part of the upstream YNYTRAY sequence closest to the 3' splice junction [3]. One possible solution to this intriguing problem is that there exists some consensus secondary (and tertiary) structure for eukaryotic introns with the lariat branching-sequence and the 3' splice site occupaying unique sites within it. Intron secondary structure computations have been carried out. So far the results for nuclear pre-mRNAs are not encouraging. Possibly this is due to the inherent shortcomings of RNA secondary structure calculations. Tertiary structure computations are even more difficult. These are technical difficulties
which may, however, hinder the solution. Alternatively, while elaborate secondary structures are important in self-splicing mitochondrial introns, they may be inessential in splicing of nuclear precursor mRNAs [15,16]. Here I try to circumvent these shortcomings by approaching the problem from the simpler, linear sequence standpoint. All of the signals noted so far at and upstream of the 3' splice site occur elsewhere in the genome as well: the AG doublet, a run of several pyrimidines sometimes interspersed with purines and the YNYTRAY consensus. The fact that sharp signals are picked up in sequence alignments results from their frequency recurrence in many introns at the specified locations. Are there additional, frequently recurring oligomers in the vicinity of the 3' splice sites? If such sequences exist, they could aid in protein or nucleoprotein branch-point and splice site recognition. Here I list additional putative 3' splice signals. I also suggest that some of these newly noted potential signals may form two alternative stemloop structures. Their possible relevance to branch-point recognition and 3' site splicing is noted. Methods
This work uses two computer programs. The first is an improved version of an interactive program described previously [10]. Briefly, the program selects any group or subgroup of sequences from the database. These sequences are next aligned according to any specified biological marker, in this case the 3' junctions of introns and their adjacent exons. These sites are assigned position 0. A span is next specified, - 500 to + 500 for the primate file. A query oligomer is defined. The computer scans all aligned sequences and notes the number of occurrences of the query oligomer in each position within the noted span. The base composition in each position in all aligned sequences is noted as well. The output can be printed or used in a graphic representation. For the latter, a window is chosen (15 nucleotides here) and shifted (by one nucleotide at a time in this work) within the specific span. We sum the total number of occurrences of the query oligomer and, sep-
263
arately, the total number of bases. The ratio of the first sum to the second (the Y axis in our figures) is graphed as a function of the position in the aligned sequences (X-axis). The second program uses the raw output from the first program. It combines specified files (e.g., primates and rodents and invertebrates, etc.) in which occurrences of the same defined query oligomer are noted. It computes the sums of the occurrences of this oligomer in the various groups as functions of the nucleotide positions in the aligned sequences. Here I have analyzled the distributions of the 256 quartets in all 211 primate sequences with 356 3' splice sites present in the July, 1986 issue of GenBank. Since the splicing mechanism appears to be well conserved in eukaryotic sequences, this same analysis was repeated for the combined primate, rodent, other mammal, non-mammalian vertebrate and plant (yeast) introns present in the database with over 900 3' splice sites. The results obtaine for the primates are compared with those of the extended eukaryotic file. Keller and Noon [12] have already carried out computer searches near the 3' splice sites. Their procedure differs, however, from the one used here. They have looked for a consensus sequence resembling the yeast TACTAAC box, known to be involved in splicing. My searches, on the other hand, are not restricted to a predefined sequence. The distributions of all quartets have been examined. Only the oligomer-length is specified. Triplets are too short and I would have been inundated with data had I analyzed the distributions of over a thousand pentamers for each file. Also from the statistical standpoint only a few occurrences of sequence-specific pentamers at any give position are expected. Hence I give the results for the primates. Resuits obtained for the more extended eukaryotic files analysis are cited for comparison with the newly noted putative primate 3' signals. Details of the latter, extended eukaryotic analysis with the respective graphs will be presented elsewhere. Results
Figures 1-8 present some of the putative signals upstream and downstream of the 3' splice
003
CAGG
(o)
AGGT
(b)
CAGA
0.02
(c) I-0 I1.
o
- 400
0
400
POSITION Fig. 1. The distrtibutions of AG-containing quartets, (a) CAGG, (b) A G G T and (c) CAGA, in 356 primate precursor m R N A introns. 0 is the 3' splice site. Negative values on the X-axis refer to positions upstream of the 3' splice site. Positive values refer to positions downstream of it. The Y axis indicates the fraction of sequences containing the quartet at the corresponding site. For further details see Methods.
sites in primate nuclear precursor mRNA sequences. Fig. la depicts some of the distributions of the G-containing quartets possessing sharp peaks at position 0, e.g. at the 3' splice junction.
264
Fig. 2 displays some of the striking occurrences of pyrimidine-containing quartets. Sharp peaks are obtained just prior to position 0 with a steep decline at that site by 1TIT, TTCT, TCTC, TTTC, C 4, CTTC, CTCC, TCCC, CCTC, TTCC, TCCT and CTTT. Among these, T4 is the most striking. The distributions of the Y3R1 quartets, ATTT, TGTT, GTTT, TTAT, TTGT, TGTC and TATT also display sharp peaks at this site. These signals TTTT
(Q)
0,04
TTCT 0.03
probably result from the high frequency of Y-runs at this location. Nevertheless, it is of interest that the frequent quartets are those which contain several T's. We shall come back to this point later. Fig. 3 displays the distribution of the consensus sequence noted above, YTRAY. Keller and Noon [12] have detected the conC sensus sequence A TAAT in sea-urchin genes, T CTGAC in mice, rat and in human globin genes, G C CTAAT in Drosophila and CT A in chicken A T and duck intron sequences. Fig. 4a-e shows signals related to the Keller and Noon consensus. Fig. 4a-c displays the distributions of CTGA, TGAC and TGAT. The latter has been noted by Keller and Noon as occurring in three human globin genes. Fig. 4d shows the distribution of TAAT, noted by Keller and Noon as a Drosophila, a chicken and duck consensus sequence. CTGT (Fig. 4e) is a variant of the chicken and duck CTGA consensus. In Fig. 5 two putative signals are depicted. GGGA and AGGG peak at positions - 6 2 and - 6 3 , respectively, suggesting an AGGGA consensus 60 nucleotides uptream of the 3' splice site. Comparison with the results obtained for the larger eukaryotic file shows that both of these quartets are very frequent at this - 6 0 site. As in primates,
+crJ-(c)
YTRAY
0.02
!
I-Z
i I
ELI-
I.--
mr'
Z LU U n,"
0
,
-400
,
,
,
,
o POSI
400 TION
Fig. 2. The distribution of Y4 quartets, (a) "ITI'T, (b) TTCT and (c) TCTC around primate 3' intron splice sites. For further details see caption to Fig. 1.
o
-40o
0 POS [T I 0 N
Fig. 3. The distribution of the YTRAY consensus sequence around primate 3' intron splice site. For further details see caption to Fig. 1.
265
0.20 CTGA
(a) TAAT
O.Ol
(d),
I o4
d
0.01
(b)
J!
TGAC
Zo,01 ELI=
-4oo
I1 Ir ' llj
o
CTGT
(e) "C
4oo
POSITION
(c) I-Z LU L) OC LU Q..
TGAT
).01
-4oo
0
400
POSITION
the G G G A peak is, however, more striking. Fig. 6 a - c shows the distributions of the related quartets CCCA, ACCC and CCAT. The CCCA peaks extend from + 3 to + 44, with the highest points at 10 and 25 nucleotides downstream of the 3' splices site. In the fuller eukaryotic file analysis the peaks obtained at these sites are more striking. Fig. 6d shows the distribution of CCCA for the combined eukaryotes file. The concentration of this quartet in the region following the 3' splice is very high at the + 3 to + 4 4 span. CCCG and CGCC (not shown) also yield clear peaks in this
Fig. 4. The distributions, of Keller and Noon [12] consensus sequences, (a) CTGA, (b) TGAC, (c) TGAT, (d) TAAT and (e) CTGT around primate 3' intron splice sites. For further details see caption to Fig. 1.
region in both the primates and the extended eukaryotes analyses. Fig. 7 shows some additional putative signals noted in the analysis. In the combined eukaryotes file, T G G C peaks at + 100, GACC shows a significant peak at + 20, CATC peaks at + 12, G A T G at + 20, G T G A at + 8, G T C T at - 1 4 and + 20, ATCC at +14, A T G T at +7, G A T T at - 4 0 , AAAT at - 7 0 , ATTC does not show a significant peak at G G T A peaks at - 85. In our analysis we have taken the sequences listed in the appropriate GenBank files. The files
266
GGGA
(o)
t' .01
I !
AGGG
(b)
i
Zrool ~
o
-40o
~~
i 0
~f
ACCC
o
40o
POSITION
!1~ o.o2
Fig. 5. The distribution of (a) G G G A and (b) G G G A around primate 3' intron splice sites. For further details see caption to Fig. 1.
,
i
!!
'i ', t i,
CCAT.
(c)
!~ I i
o '!'1, -400 are not edited to remove related sequences. Although to some extent this may result in over- or underrepresentation of specific features, this would probably not affect significantly the nature of the results presented here. The number of the analyzed sequences is large enough to override the potential effects of including related sequences in the calculations. Throughout this paper, although discussing precursor mRNAs, all sequences are quoted as having 'T' rather than a 'U'. This is done for convenience, since the GenBank sequences are used. All calculations are carried out on the DNA strand which has the same sequence as the precursor mRNA transcript. This paper attemps to survey all 3' splice signals noted to date. This inevitably results in numerous graphs. The new potential putative signals are presented in Figs. 5-7.
0
400
CCCA
(d)
z
LIJ
.01
u.I O_
o
-2oo
,
0
20o
POSITION
Fig. 6. The distribution of (a) CCCA, (b) A C C C and (c) C C A T around primate 3' intron splice site. (d) Shows the distribution of CCCA at the same position in the larger eukaryotic file, containing over 900 introns. For further details see Methods and the caption to Fig. 1.
267
TGGC
oo
:I
GACC
(b)
GTGA ),01
o
l
;I
II
CATC
co)
GTCT
U n
-400
0 POSI TION
4O0
-
400
0 POSITION
400
Fig. 7. The distributionsof (a) TGGC, (b) GACC,(c) CATC, (d) GATG, (e) GTGA, (f) GTCT, around primate 3' intron splice sites. For further detailssee caption to Fig. 1.
Discussion Specifically, the very extensive computations presented here imply the following known as well as new features: (i) The AG doublet has long been known to exist at the 3' splice boundary [14]. It is thus not surprising that AG containing quartets show sharp peaks at this site. We find that not all such quartets are equally frequent at the 3' splice site. Fig. 1 demonstrates those quartets which are most frequent. (ii) The peaks presented by the Y4 quartets
(Fig. 2) are not surprising, either. Pyrimidine runs just upstream of the 3' splice site have been known for some time, too [14]. Perhaps the unexpected result is that the most frequent Y4 at this site is T4. Next are those which also contain several T's. Just upstream of the 3' splice site some Y3R quartets show peaks. Here, too, the quartets which are most frequent contain three T's. (ii) The peak seen in the distribution of YTRAY consensus is seen very clearly [3]. (iv) Strong peaks are observed in the distributions of the Keller and Noon consensus sequences. Fig. 4 also demonstrates the background behavior.
268 For example, although T G A C (Fig. 4b) peaks at -20, it peaks following the 3' splice site as well. Indeed, it is interesting that some of these quartets occur more frequently downstream of the 3' splice site than upstream.(v) Interesting new findings are shown in Figs. 5 and 6. A G G G and G G G A are frequent at 60 nucleotides upstream of the 3' splice site, yielding very sharp peaks (Fig. 5). CCCA, on the other hand, is very frequent following the 3' splice site. Additional putative signals are given in Fig. 7. The structure of the nuclear precursor m R N A is unknown. Nevertheless, it is of interest that (A)GGG(A) and the partially complementary (A)CCC(A) yield putative signals upstream and downstream of the branching point and 3' splice sites. Also, while (A)GGG(A) demonstrates a narrow, well-defined peak with a clear boundary, (A)CCC(A) yields a wider, thicker, less well defined peaks. In the combined eukaryotic file, the A G G G also yields a wide peak, whereas the G G G A peak is, like the one shown here, sharp and with well defined boundaries. A possible model is presented below (Fig. 8). It is perhaps not inconceivable that the AGG G A 60 nucleotides upstream of the 3' splice site
(Fig. 8a) base-pairs with the Y-run 10-20 nucleotides upstream of this 3' site. In such a structure the Y N Y T R A Y branching-sequence, at approx. 25 nucleotides upstream of the splice position, is present in the loop, sticking out (Fig. 8b). This may conform with the point noted in the Introduction, namely, that it is unlikely that the inspecific Y N Y T R A Y sequence forms a stem structure. On the other hand, having such a sequence in a loop might enable its recognition and facilitate its protein mediated branching reaction. In this regard it is of interest to note that C 4- and C3-containing quartets are less frequent among the Y3 and Y4 oligomers (point (ii) above, Fig. 2). This may suggest that this stem is not too stable and consists of AU, GC and G U base-pairs. The small nuclear ribonucleoprotein may hydrogen bond with the nucleotides upstream of the 3' splice site [17]. During, or following the lariat formation, the secondary structure may change. The small loop opens up and a larger one recloses (Fig. 8c). In the newly formed loop the (A)GGG(A) base pairs with the (A) CCC(A) present 3 to 44 nucleotides following the 3' splice site. Possibly additional nucleotides participate in the construction of the stem, making it more sta-
YNYTRAY (A)GGG,(A) -60
,~ 5' splice
I
Y...Y
AG
(e)
! (A)CCC(A) 0 3' ~ splice
(b)
(c) Intron
y -',,- YNYTRAY
Intron
A6 (A)CCC(A) ! Exon 3' pl.ice
Exon
i 3'
t ACCCA Exon splice
Fig. 8. A possible model for alternate secondary structures near the 3' splice site. The locations of the putative signals on the linear precursor mRNA are shown in (a). (b) The (A)GGG(A) at 60 nucleotidesupstream of the 3' splice site form a stem with the Y-run just upstream of that 3' site (Y is a pyrimidine; thus either GC or GU base pairs can form here). (c) A branch of the adenine of the YNYTRAY has already formed. The (A)GGG(A) hydrogen bonds with the (A)CCC(A), 3 to 44 nucleotides downstream of the 3' splice site. The 3' splice site is now in the loop. Further details are given in the Discussion.
269
ble. The high concentration of CCCA (Fig. 6d) may suggest that such pairing takes place even if slippage of the two partners occurs. In such a hairpin structure the 3' splice site is in the loop, exposed to enzymatic cleavage attack. Thus, in both the branching-lariat formation and 3' site splicing and ligation, the nucleotides involved may be present in the loop (Fig. 8). As the relevant figures demonstrate, there is an abundance of C's which could potentially base-pair with the G's. How stable would be such a structure? Since the graphs presented here are composed of numerous primate (and eukaryotic) sequences, I cannot present a detailed secondary structure model with a free energy estimate. Such structures would have to be computed separately for each sequence, and it may be that a consensus structure could be derived. It should be noted however, that there is no direct evidence that such structures exist. Some objections to such a model may arise from the finding that most of the intron can be removed without affecting the splice sites. Analysis of deletion mutants of the rabbit /3-globin intervening sequence 2 carried out by Wieringa et al. [9] has demonstrated that only the conserved 5' and 3' splice junction sequences are required for mature mRNA production. It is, however, quite likely that in such cases other sequences present in the preceding exon play the same role normally played by the intron sequences. While much of the discussion above has been geared to the framework of a particular model, it should be emphasized that the main results, namely, the statistically significant putative signals, are model-independent. So far, most splicing signal-searches have focused on the introns, possibly because of the frequent coding constraints on the exons. Figs. 6 and 7 show that putative signals may exist in exons too, and should be taken into account. This approach also demonstrates the usefulness of general searches for all possible oligomers, in addition to looking for a preconceived concensus. Indeed, recent analysis of the in vitro splicing products of RNA precursors containing tandem duplications of the 5' or 3' splice sites of human fl-globin intervening sequence 1 revealed that exon sequences play an important role in the relative use of the duplicated sites [18]. These studies also
show that the proximity of the 5' and 3' splice sites is an important determinant in the selection of splice sites. Somasekhar and Mertz [19] have also found that the pattern of splice site selection in alternatively spliced viral pre-mRNAs can be altered by mutations within exons. Thus, exon sequences are clearly implicated in preferred splice site usage. After this work was completed, a review by Cech came out [20,21]. It is intriguing to observe that the self-splicing Tetrahymena intron resembles in some respects the protein-mediated cis-splicing of nuclear presursor mRNA. Though present in a different location, A G G G A G G plays a role in Tetrahymena splicing, too. Cech suggests, as I have here, that this sequence is base-paired to a pyrimidine run located in the intron (at a different site than in the nuclear precursor mRNA). The attacked residue is in the hairpin loop. Another cycle of intron splicing occurs when synthetic C 5 is added. The latter apparently base-pairs with the AGGGAGG, too [20,21]. I have found that in the naturally occurring nuclear precursor mRNA the ACCCA is frequently present in the exon. According to my model (Fig. 8) the attacked nucleotides are not necessarily immediately next to the hairpin stem as is the case in the self-splicing model of Cech. Clearly there are additional differences as well. Nevertheless, the similarity is interesting and some evolutionary link may exist.
Acknowledgments The programs used in this work have been written by J. Owens from the Laboratory of Mathematical Biology, NCI. I would like to thank him for his help and for further modifications he has introduced. I thank Dr. R. Jernigan for bringing to my attention the review by T. Cech in Scientific American. References 1 Mount, S.M. (1982) Nucleic Acids Res. 10, 459-472 2 Breathnach, R. and Chambon, P. (]981) Annu. Rev. Biochem. 50, 349-384 3 Reed, R. and Maniafis, T. (1985) Cell 41, 95-105 4 Padgett, R.A., Konarska, M.M., Grabowski, P.J., Hardy, • S.F. and Sharp, P.A, (1984) Science 225, 898-903 5 Ruskin, B., Krainer, A.R., Maniatis, T. and Green, M.A. (1984) Cell 38, 317-331
270 6 Zeitlin, S. and Efstratiadis, A. (1984) Cell 39, 589-602 7 Domdey, H., Apostol, B., Lin, R.J., Newman, A., Brody, E. and Abelson, J. (1984) Cell 39, 611-621 8 Rodriguez, J.R., Pikielny, C.W. and Rosbash, M. (1984) Cell 39, 603-610 9 Wieringa, B., Hofer, E. and Weissmann, C. (1984) Cell 37, 915-925 10 Nussinov, R. (1986) Nucleic Acids Res. 14, 3557-3571 11 Nussinov, R., Owens, J. and Maizel, J.V., Jr. (1986). Biochem. Biophys. Acta 866, 109-119 12 Keller, E.B. ad Noon, W.A. (1984) Proc. Natl. Acad. Sci. USA 81, 7417, 7420 13 Nakata, K., Kanehisa, M. and DeLisi, C. (1985) Nucleic Acids Res. 13, 5327-5340
14 Breathnach, R., Benoist, C., O'Hara, K., Gannon, F. and Chambon, P. (1978). Proc. Natl. Acad. Sci. USA 75, 4853-4857 15 Schmelzer, C. and Schweven, R.J. (1986) Cell 46, 557-565 16 Sharp, P.A. (1985) Cell 42, 397-400 17 Black, D.L., Chabot, B. and Steitz, J.A. (1985) Cell 42, 737-750 18 Reed, R. and Maniatis, T. (1986) Cell 46, 681-690 19 Somasekhar, M.B. and Mertz, J.E. (1985) Nucleic Acids Res. 13, 5591-5609 20 Cech, T.R. (1986) Sci. Am. 225, 76-84 21 Zaug, A.J., Been, M.D. and Cech, T.R. (1986) Nature 324, 429-433