J. theor. Biol. (1981) 88,241-243
Possible Palindromic DNA JAMES
0.
GAILIT
AND
DIXIE
W.
FREDERIKSEN
Department of Biochemistry, Vanderbilt University, Nashville, Tennessee37232, U.S.A. (Received 18 July 1979, and in revised form 27 June 1980)
Using a computer to examine 767 proteins, we discovered 40 proteins, many of them small, which may be coded by substantially palindromic double-stranded DNA. These 40 proteins include the hormone somatostatin and the structural protein collagen. In this communication we discuss 40 proteins which may be coded by substantially palindromic double-stranded DNA. The two strands of a double-stranded palindrome are identical when read with the same polarity. DNA palindromes have inspired much recent speculation (Karrer & Gall, 1976; Engberg et al., 1976; Subramanian, Reddy & Weissman, 1977; Doyle, 1978). Palindromes appear in promotor and operator regulatory sites of DNA, and the cruciform or hairpin structures of some palindromes might provide a recognition site for proteins binding to DNA (Engberg & Klenow, 1977; Wuilmart, Urbain & Givol, 1977). We used a computer to search the Protein Sequence Data Tape 76 (National Biomedical Research Foundation, Georgetown University Medical Center, Washington, D.C.) for proteins which may be coded by palindromes. This tape contains the sequence information given by Dayhoff (Dayhoff, 1972, 1973, 1976), a total of 767 proteins and 77 267 residues. From each amino acid sequence we derived a string of sets of possible codons, one set from each residue. The particular amino acid at each point along the protein determines the set of possible codons for that point. We then took the complements of all codons in a string, read the new string of sets of possible codons in the 5’ to 3’ direction, and compared that new string of sets of possible amino acids with the original protein. In short, we determined how many residues of the original protein can be coded by the complementary DNA strand. We then divided that number by the number of residues in the original protein to calculate percent homology. The average homology for all 767 proteins is 17*6%, with a standard deviation of 9.3. Forty proteins with calculated homology greater than two standard deviations above the average are listed in Table 1. 242
0022-5193/81/020241+03$02.00/0
@ 1981Academic
Press 1~.
(London)
Ltd.
242
J. 0.
GAILIT
AND
D.
W.
TABLE
FREDERIKSEN
1
Proteins with calculated homology greater than two standard deviations above average Number
Protein-Sourcet Ferredoxin I-Chlorobium limicola Lysozyme-Goose (fragment) Melanostatin II-Bovine Somatostatin-Sheep Secretin-Pig Angiotensin I-Chicken Met-lys-bradykinin-Bovine Kinin peptide III-Frog Phyllomedusin-Frog Xenopsin-Xenopus laevis Viscotoxin B-European mistletoe Phagocytosis-stimulating peptide (Tuftsin)-Human Protamine-7 sourcesS Collagen alpha 1 chain-Bovine skin (fragments) Collagen alpha 1 chain-Rat skin (fragments) Collagen alpha 1 chain-Chicken skin (fragments) Collagen alpha 2 chain-Bovine skin (fragments) Lipid-binding protein C-I-Human (ser) Fibrinopeptide A-16 sources+
of
residues in protein
Per cent homology
60 30 5 14 27 10 11 14 10 8 46 4 32.4 779 671 167 124 57 16.3
43.3 40.0 40.0 71.4 37.0 40.0 36.4 42.9 40.0 50.0 39.1 50.0 42.2 41.1 41.1 39.5 40.3 42.1 42.9
t Proteins and their sources are given just as they appear on the Protein Data Tape 76. Some of the sequences are fragmentary. $ Proteins of this type are not listed individually; the values given are averages.
Somatostatin has the highest calculated homology. This hormone contains 14 amino acids in the sequence ala-gly-cys-lys-asn-phe-phe-trplys-thr-phe-thr-ser-cys. One possible nucleotide sequence of mRNA for this protein is GCAGGAUGUAAAAAUUUUUUCUGGAAAACUUUUACAUCCUGC. Written
with the same polarity the complement
of this sequence is
GCAGGAUGUAAAA&JUUU6?~G~AAAA&JUUUACAUCCUGC. Of 42 bases in each of these strands the complement differs from the original by the 8 indicated bases. The amino acid sequence coded by the complement differs from the original amino acid sequence by four residues. The base homology is 8 1% ; the amino acid homology is 7 1% . It is quite unlikely that the corresponding sequence of DNA arose by chance alone (Galas, 1978). A
POSSIBLE
possible hairpin configuration is shown below.
PALINDROMIC
243
DNA
for the mRNA,
or the DNA, of somatostatin
GCAGGAUGUAAAAAUUUUUU
llllllllllllI
III1
c
CGUCCUACAUUUUCAAAAGGU The DNA coding other proteins listed in the table may have sufficient symmetry for hairpin or cruciform structures to form. Moreover, since most of the listed proteins are small, the genes coding those proteins are less likely to contain inserted sequences. The largest proteins listed in Table 1 are collagen chains. Collagen contains about 33% glycine and 25% proline and hydroxyproline. A fair description of collagen is the sequence (gly-x-pro),, where x represents a variety of amino acids, because the sequence gly-x-pro occurs frequently and repetitively in this structural protein. The mRNA sequence (GGGXIXzXSCCC), and its complement, where the Xi s are any of the four possible ribonucleotides, can code (gly-x-pro), in two reading frames of each mRNA strand. Therefore, the corresponding double-stranded DNA can contain the original gene and three additional superimposed genes for the collagen repeat. REFERENCES DAYHOFF, M. 0. (1972). Atlas of Protein Sequence and Structure, vol. 5. Silver Spring, Maryland: National Biomedical Research Foundation. DAYHOFF, M. 0. (1973). Atlas of Protein Sequence and Structure Suppl. 1. Silver Spring, Maryland: National Biomedical Research Foundation. DAYHOFF, M. 0. (1976). Atlas of Protein Sequence and Structure Suppl. 2. Silver Spring, Maryland: National Biomedical Research Foundation. DOYLE, G. G. (1978).J. theor. Biol. 70, 171. ENGBERG,J., ANDERSSON,P.,LEICK,V.& COLLINS,J. (1976).J. mol. Biol. 104,455. ENGBERG, J. & KLENOW,H.(~~~~). Trends inBiochem.Sci. 2, 183. GALAS, D. J. (1978). J. theor. Biol. 72,57. KARRER,K. M. & GALL,J. G. (1976).J. mol. Biol. 104,421. SUBRAMANIAN,K.N.,REDDY,V.B.&WEISSMAN,S.M. (1977). Cell 10,497. WUILMART, C.,URBAIN,J. & G1vo~,D.(1977). Proc. mm. Acad. Sci. VXA.74,2526.