BioSystems 51 (1999) 95 – 100 www.elsevier.com/locate/biosystems
Specific amino acid content and codon usage account for the existence of overlapping ORFS Zsolt Boldogko˜i a,*, Endre Barta b a
Laboratory of Neuromorphology, Semmelweis Uni6ersity of Medicine, Tu¨zolto´ ut 58, Budapest H-1094, Hungary b Computer group, Agricultural Biotechnology Center, Szent-Gyo¨rgyi Albert ut 4, Go¨do¨llo˜ H-2100, Hungary Received 20 November 1998; received in revised form 31 March 1999; accepted 1 April 1999
Abstract Here we present a novel hypothesis for the origin of overlapping open reading frames (O-ORFs) observed in the ‘non-coding frames’ of several genes of yeast chromosome II. By computer analysis it was found that the specific amino acid content and base distribution pattern at certain genomic locations and the presence of O-ORFs were related. This observation prompt us to conclude that these O-ORFs are mere statistical curiosities without any biological function, which is in contrast to the hypotheses proposed by other authors. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Yeast genome; Overlapping open reading frames; Antisense ORFs; Gene evolution; Genetic code
1. Introduction The problem of the coding capacity of the antisense DNA strand has become an important issue in genomics. Several studies have raised the possibility of expression of proteins from the ‘non-coding’ DNA strand (Sharp 1985; Goldstein and Brutlag 1989; Vlcek et al. 1993; Boles and Zimmermann 1994). However, transcription and translation from the antisense strand have been detected in limited cases (Rak et al. 1982; Henikoff et al. 1986; Adelman et al. 1987; Ward et al. 1996). Two lines of evidence led some * Corresponding author. Tel.: + 36-1-2156920, ext. 3694; fax: +36-1-2181612. E-mail address:
[email protected] (Z. Boldogko˜i)
authors to assume coding function of the opposite DNA strand. The first one arose from the observation that specific sense-antisense relationships inherent in the genetic code would make the potentially translated antisense proteins meaningful (Blalock 1990; Zull and Smith 1990). According to Blalock’s Molecular Recognition Theory (1990) the codons and their antisense partners specify amino acids with complementary hydropathic patterns, which results in the formation of proteins exhibiting complementary structure. The author suggested that proteins coded by the complementary DNA strands could bind to each other. Zull and Smith (1990) noted that amino acids generated by reading both strands of the genetic code formed three non-communicating groups, i.e. each amino acid belonged only to one
0303-2647/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 ( 9 9 ) 0 0 0 1 8 - 0
96
Z. Boldogko˜i, E. Barta / BioSystems 51 (1999) 95–100
of the three groups. They stated that the same groups could be obtained if amino acids were clustered according to their frequency in specific secondary structures (a-helix, b-sheet, or turning position of a protein) and thus, the similar nature of sense and the corresponding antisense amino acids would allow the putative antisense proteins to retain the secondary structure patterns of the sense proteins. The authors suggested that this sense-antisense exchange pattern and the redundancy of the genetic code could be evolutionarily related. The second line of evidence led to support the idea of translation from the antisense DNA strand was based on the observation of genelength antisense (A)-ORFs in various organisms. Several hypotheses have been raised to interpret these findings. Merino et al. (1994) observed numerous in-phase (frame 1%) A-ORFs by analysing a large number of sequences derived from various species using a computer. They suggested that the predominant use of RNY (N: any base; R: purine; Y: pyrimidine) triplets could account for the existence of A-ORFs, since these triplets do not involve stop codons in either direction. In fact, the authors were undetermined whether the A-ORFs were created by recent selective forces or were the relics of an ancient genetic translation system. Yomo et al. (1992) analysing the genes encoding the nylon oligomer-degrading enzymes, hypothesised that, although the probability of a long A-ORF arising is very low, once created, special mechanisms protect them from mutations that generates stop codons. According to Ikehara and Okazawa’s (1993) divergent model, the Flavobacterium first created DNA composed of repeating symmetrical triplets, such as (5%-GNC-3%)n. Various bases were then substituted in each codon positions in order to decrease the frequency of stop codons, which resulted in the generation of functional genes, as a consequence. Recently, Cebrat and Dudek (1996) presented a hypothesis to explain the existence of A-ORFs found in yeast chromosome II sequenced by Feldman et al. (1994). This is based on the observation that the amber and ochre codons contain two-base palindromes (TA), which are good substrates for generating stop codons in the related antisense frame.
By eliminating them from frame 1 (coding frame) using a computer, they obtained long A-ORFs at frame 3%. The authors suggested that elimination of stop codons could play an important role in the evolution of genes by generating ORFs on both DNA strands. In an other article the same authors (Cebrat et al., 1998) analysed the putative coding properties of A-ORFs found in the yeast genome. It can be seen that all of the above hypotheses assume that A-ORFs are functional. One group of the models (Blalock 1990; Zull and Smith 1990; Merino et al. 1994; Ikehara and Okazawa 1993; Cebrat and Dudek 1996; Cebrat et al., 1998) states that the evolution of genetic code itself was influenced to retain the capacity of DNA for double-strand coding. Other models invoke the action of selective forces either in the generation of ORFs on the opposite strand of existing genes (Ikehara and Okazawa 1993; Cebrat and Dudek 1996), or in keeping open the A-ORFs arisen spontaneously (Yomo et al. 1992). In contrast, according to the model proposed by Boldogko˜i and Murvai (1994), Boldogko˜i et al. (1994, 1995), overlapping (O)-ORFs located on both DNA strands were generated as a ‘byproduct’ of a still unknown process resulting in the accumulation of guanine (G) and cytosine (C) bases at the silent (manly third) positions of codons in GC-rich DNA sequences. For explanation, since stop codons (TAG, TGA, or TAA) all start with thymine (T), the predominant occurrence of G and C bases at the third codon positions of a gene will frequently result in the low occurrence of stop codons at frames 3 and 1% (see Table 1 for frame relationships). Here we present a novel hypothesis, which is a generalisation of our previously published model presented above. In addition to A-ORFs, this study also includes the analysis of out-of frame O-ORFs located at the sense strand, since we think that the same mechanisms are responsible for their creation.
2. Methods Yeast chromosome II sequences and the list of ORFs were taken from the anonymous FTP server of Martinsried Institute for Protein Se-
Z. Boldogko˜i, E. Barta / BioSystems 51 (1999) 95–100
quences (MIPS). DNA sequences were extracted using standard GCG programs. Sequences were edited to remove introns. All of the existing ORFs (410) were used for further analysis. The frequency of mono-, di- and trinucleotides of all frames were calculated by using PERL scripts written by the authors. The difference in lengths of ORFs was taken into account by using weighted averages for calculation of overall base distribution frequencies. The frequencies at the overlapping region was calculated separately for each type of O-ORF. When two ORFs overlapped each other, we considered the longer one as a gene and the smaller one as an O-ORF, when it was doubtful. We found this distinction reasonable, since codon usage of longer ORFs were in all cases much more similar to the average than that of smaller ones (data not shown). The data of O-ORFs were disregarded in the calculation of overall base distribution frequencies, however it did not significantly influence the obtained values.
3. Results and discussion The hypothesis discussed here is based on the observation that specific amino acid content and codon usage can generate O-ORFs by decreasing
Table 1 Phase relationships among different reading framesa
the probability of stop codons- and/or by increasing the probability of start codons arising. Similarly to Cebrat and Dudek (1996), we have also analysed the yeast chromosome II for O-ORF longer than 300 base pairs (bp). The result we obtained was different from that of these authors. Analysing all of the 410 genes, we found 27 O-ORFs: 18 at frame 3%, 4 at frame 1%, 1 at frame 2%, 3 at frame 2 and 1 at frame 3 (see Table 1 for frame relationships). Specifically, our analysis showed the following (see Table 2).
3.1. Frame 3 % We found that the frequency of TA dinucleotides at phase 1 (TAN) was lower than at the other two phases (NTA and NNT – ANN), and this was significantly lower at the overlapping region of genes. We explain this by the low frequency of triplets TAY coding for tyrosine (Tyr) at these locations (TAR specifies stop codons). We also examined the distribution of AT dinucleotides and found a high overall ATN content, which was even higher at the overlapping region. TA and AT dinucleotides are two-base palindromes which form the first two bases of the stop and start codons, respectively. Thus, low TAN and high ATN content at any region of a gene enhances the probability of generation O-ORFs at frame 3%.
3.2. Frame 1 %
Sense Frames
1 2 3
1
2 1
3 2 1
1 3 2
2 1 3
3 2 1
1 3 2
2 1 3
3 2 1
Antisense Frames
1% 2% 3%
3
2 3
1 2 3
3 1 2
2 3 1
1 2 3
3 1 2
2 3 1
1 2 3
a
97
The numbers indicate the position of a base in a triplet. Since amino acids are determined by triplets, the number of phases is limited to three on both sense and antisense DNA strands. In a hypothetical 4th phase the order of amino acids would be identical to that of 1st with the exception that the first amino acid would be missing in the former case. The order of bases in the sense DNA strand is the same as in mRNA with the exception that U base is used instead of T in the mRNA.
The four genes containing 1%-frame O-ORFs have low NTA content especially at the overlapping region, which is explained by the high G+ C content in codon position 3 of these genes. This kind of codon usage results in the reduced chance for stop codons arising at this frame on the complementary strand, for which the same explanation is held as in the case of GC-rich organisms (see Introduction). In addition, these genes have very high CAT (determining histidine) content, which produces the same number of ATGs at frame 1’. High initiation and low termination codon frequency allows the appearance of relatively lengthy O-ORFs at this frame.
98
Frequency
Overall At the overlapping region
TAN
3.5 1.0 (3%)
NTA
7.0 1.6 (2) 2.1 (1%)
NNT – ANN ATN
9.2 3.9 (3) 0.0 (2%)
8.7 10.3 (3%)
NAT
10.8 –
NNA – TNN CAT
6.4 –
1.4 7.0 (1%)
A+G at Codon Position
G+C at Codon Position
1
2
3
1
2
3
61.6 –
49.2 64.1 (3)
47.7 –
44.3 62.2 (1%) 73.5 (2%)
36.7 –
37.9 60.2 (3)
a For the sake of simplicity, we used T letter in triplets independently whether it was the component of DNA or RNA strand. To be correct, we should have used U letters when speaking about codons (triplets on mRNA). The number in parenthesis is the frame of the overlapping region for which the frequency is calculated.
Z. Boldogko˜i, E. Barta / BioSystems 51 (1999) 95–100
Table 2 The average frequency (%) of triplets of all examined genes (frame 1) found on yeast chromosome II and the average frequencies of triplets at the appropriate frames containing O-ORFsa
Z. Boldogko˜i, E. Barta / BioSystems 51 (1999) 95–100
3.3. Frame 2 % G and C accumulation in codon position 1 results in the low occurrence of TA dinucleotides at phase 3 (NNT – ANN), which accounts for the low frequency of stop codons at frame 2’ since two of the three termination codons have TAR configuration.
3.4. Frame 2 The three genes possessing O-ORFs in frame 2 contain low number of NTA triplets at the overlapping region, which produces O-ORFs at this frame. The same reasoning can be applied for the explanation as above.
3.5. Frame 3 G + C accumulation in codon position 3 results in relative low NNT – ANN content which results in the low frequency of stop codons with TAR configuration at this frame. In addition, the A+ G content in codon position 2 of this gene is significantly lower than the average, which further decreases the frequency of stop codons. For explanation, codon position 2 of a gene corresponds to the third codon position of a triplet located at frame 3. Stop codons contain A or G as the third base in their codons, thus low A + G content at codon position 2 of a gene (frame 1) results in low probability of stop codon appearance at frame 3. In conclusion, the clustering of O-ORFs at frame 3% is explained by the low frequency of triplets coding for Tyr, as well as the high ATN content at the overlapping region. Although, the low number of O-ORFs at other frames does not allow precise statistical analysis, their existence can also be explained by specific base distribution patterns, including G and C accumulation at codon positions 1 or 3, low TA and high AT content at other phases, etc. We do not think that genes were created (simultaneously on both DNA strands) by elimination of stop codons. In fact, spontaneously arising stop codons are really selected against to conserve protein function. However, this process differs from what Cebrat and Dudek simulated using a computer. While they
99
eliminated stop codons from the whole gene pool, negative selection removes them before spreading in the population. That is, a distinction has to be made here between the terms elimination and nonexistence. Elimination of stop codons from the gene-pool, if it occurs, would really reduce the number of TA dinucleotides at phase 1. However, non-existence only means that triplets TAG and TAA are not used at frame 1 of genes. Merino and his colleagues (1994) examining the DNA of various organisms found O-ORFs almost exclusively at frame 1%, which indicates that lacking of stop codons is not an important factor in producing O-ORFs. The implication of our results on the problem of coding in overlapping frames is as follows. O-ORFs without any function can appear as a result of specific amino acid content and codon usage of a gene. Thus, our hypothesis can be considered as a ‘neutral way’ of O-ORF arising. Although, it cannot be excluded that these OORFs are utilised thereafter, we do not think that the mere presence of functionless ORFs would be critical for the emergence of new genes.
References Adelman, J.P., Bond, C.T., Douglass, J., Herbert, E., 1987. Two mammalian genes transcribed from the opposite strands of the same DNA locus. Science 235, 1514 – 1517. Blalock, J.E., 1990. Complementarity of peptides specified by ‘sense’ and ‘antisense’ strands of DNA. Trends Biotech. 8, 140 – 144. Boldogko˜i, Z., Murvai, J., 1994. A novel explanation for the existence of open reading frames on latency-associated transcripts of a-herpesviruses. Virus Genes 9, 47 – 50. Boldogko˜i, Z., Kaliman, A., Murvai, J., Fodor, I., 1994. Sense antisense DNA strand? Acta Vet. Hung. 42, 243 – 249. Boldogko˜i, Z., Murvai, J., Fodor, I., 1995. G and C accumulation at silent positions of codons produces additional ORFs. Trends Genet. 11, 125 – 126. Boles, E., Zimmermann, F.K., 1994. Open reading frames in the antisense DNA strands of genes coding for glycolytic enzymes in Saccharomyces cere6isiae. Mol. Gen. Genet. 243, 363 – 368. Cebrat, S., Dudek, M.R., 1996. Generation of overlapping open reading frames. Trends Genet. 12, 12. Cebrat, S., Mackiewicz, P., Dudek, M.R., 1998. The role of genetic code in generating new coding sequences inside existing genes. BioSystems 45, 165 – 176.
100
Z. Boldogko˜i, E. Barta / BioSystems 51 (1999) 95–100
Feldman, H., et al., 1994. Complete DNA sequence of yeast chromosome II. EMBO J. 13, 5795–5809. Goldstein, A., Brutlag, D.L., 1989. Proc. Natl. Acad. Sci. USA 86, 42 – 45. Henikoff, S., Keene, M.A., Fechtel, K., Fristrom, J.W., 1986. Gene within a gene: nested Drosophila genes encode unrelated proteins on opposite DNA strands. Cell 44, 33–42. Ikehara, K., Okazawa, E., 1993. Unusually biased nucleotide sequences on sense strands of Flavibacterium sp. genes produce nonstop frames on the corresponding antisense strands. Nucleic Acids Res. 21, 2193–2199. Merino, E., Balba´s, P., Puente, J.L., Bolivar, F., 1994. Antisense overlapping open reading frames in genes from bacteria to humans. Nucleic Acids Res. 22, 1903–1908. Rak, B., Lusky, M., Hable, M., et al., 1982. Expression of two proteins from overlapping and oppositely oriented genes on transposable DNA insertion element IS5. Nature 297, 124 – 128.
.
Sharp, P.M., 1985. Does the ‘non-coding’ strand code? Nucleic Acids Res. 13, 1389 – 1397. Yomo, T, Urabe, I., Okada, H., 1992. No stop codons in the antisense strands of the genes for nylon oligomer degradation. Proc. Natl. Acad. Sci. USA 89, 3740 – 3784. Vlcek, C., Kozmik, Z., Paces, V., Schirm, S., Schwyzer, M., 1993. Pseudorabies virus immediate-early gene overlaps with an oppositely oriented open reading frame-characterization of their promoter and enhancer regions. Virology 179, 365 – 377. Ward, P.L., Barker, D.E., Roizman, B., 1996. A novel herpes simplex virus 1 gene, UL43.5, maps antisense to the UL43 gene and encodes a protein which colocalizes in nuclear structures with capsid proteins. J. Virol. 70, 2684 – 2690. Zull, J.E., Smith, S.K., 1990. Is genetic code redundancy related to retention of structural information in both DNA strands? Trends Biochem. Sci. 15, 257 – 261.