J. theor. Biol. (1988) 131, 231-234
Matching Nucleotide Sequences of Human Antibodies with other Known Sequences TA1 TE Wu
Departments of Biochemistry, Molecular Biology and Cell Biology, of Biomedical Engineering, and of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, Il 60201, U.S.A. (Received 2 April 1987 and in revised form 29 September 1987) From an evolutionary point of view, the complementarity-determining regions of antibodies are distinct from other proteins including the framework regions of antibodies. A search for identical nucleotide sequences of eighty-four 15 consecutive bp in the complementarity-determining regions of human antibody heavy chains with other known sequences yielded four matches: two sequential 15-bp matches, or one 16-bp match, with the coding region of a sea-urchin testis histone H2b-2, one 15-bp match with the promotor region of a cauliflower mosaic virus inclusion body protein, and a 15-bp match with an intron between exons 1 and 2 of human factor IX. As a control, an identical search of eighty-four 15 consecutive bp in the framework regions of human antibody heavy chains yielded no matches with other sequences except those from other antibody framwork regions. Since the currently available nucleotide sequence database used in the search consisted of about 1 x 1 0 7 bp, finding such matches in the complementarity-determining regions might not be random.
Introduction As a class of proteins, antibodies are unique from an evolutionary point of view. Most proteins evolve slowly, in the order of thousands of years, to achieve a specific function in an organism (Fitch & Margoliash, 1970). On the other hand, antibodies must evolve or mutate rapidly, in the order of days, so that they can bind the stimulating antigens specifically (Gaily & Edelman, 1972). More precisely, the segments of antibody molecules which are responsible for antigen binding (Wu & Kabat, 1970), i.e. the complementarity-determining regions (CDRs), must change rapidly. Other parts of these molecules behave similarly to other proteins. The molecular bases for antibody production have been extensively studied during recent years (Milstein, 1985). However, the mechanisms involved in the production of C D R nucleotide sequences still require detailed investigation. In the mouse system, D-minigenes play an important role in the generation of the third C D R of heavy chains ( K u r o s a w a & Tonegawa, 1982). In chicken, hyper-conversions a m o n g numerous pseudogenes and a functional V-gene are responsible for all three C D R s of lambda light chains (Reynolds et al., 1987). In human, relatively little is known (Wu & Kabat, 1982). The collection of nucleotide sequences in the G e n B a n k database ( G e n B a n k Release, 1987) provides a useful starting point to study the origin o f h u m a n C D R 231
0022-5193/88/060231 +04503.00/0
© 1988 Academic Press Limited
232
T. T, WU
gene segments. On random bases, a consecutive stretch of 15 bp would occur once in 4 ~5 times, or roughly one in a billion. Since the currently available data base consists of about 1 x 107 bp, a search of about one hundred, i.e. 109/107, 15 consecutive bp should yield one match on random bases. Of course, the database is not random, consisting of many sequences from similar genes. Matches between unrelated sequences should be much less than one, while matches between related sequences could be more. In our present study, eighty-four 15 consecutive bp segments were selected from the first and second CDRs of human antibody heavy chain CE-1 (Takahashi et al., 1984) and the third CDR of human antibody heavy chain N D (Seno et aL, 1983). The control set consisted of eighty-four 15 consecutive bp segments from the framework regions of these two human antibody heavy chains. Matches from these two groups of 15-bp segments are reported here. Materials and Methods
The first CDR of human antibody heavy chains consisted usually of 15 bp with occasionally longer ones of up to 21 bp. Therefore, we selected the longest one from CE-1 (Takahashi etaL, 1984). This 2 1 b p region gave seven 15 consecutive bp segments. The second C D R of CE-1 was also used, giving thirty-four 15 consecutive bp segments. The longest third CDR of human antibody heavy chain was that o f ND (Seno etaL, 1983), with forty-three 15 consecutive bp segments. A total of eighty-four such segments from three CDRs were used for the present study. A similar set of eighty-four 15 consecutive bp segments from the framework regions of these two human antibody heavy chains was used as control. The database used for matching was that stored in GenBank (1987) since it was available on-line on the NIH-supported P R O P H E T computer system and since it could be searched relatively easily. Results
Out of the eighty-four 15 consecutive bp segments from human antibody heavy chain CDRs, four matched with other sequences in the GenBank database (1987). None of these other sequences was related to antibody genes. Two of them were sequential, giving rise to a 16 bp match. This 16 consecutive bp sequence was from the first CDR o f CE-1 (Takahashi et al., 1984), cgtggaatgtctgtga, coding for amino acid residues 32-35B and matched a segment in the coding region of a sea-urchin (S. purpuratus) testis histone H2b-2 (Lieber et aL, 1986). These two sequences are listed in Fig. 1, together with their First CDR of CE-I
31 35 actcgtggaatgtctgtgagc
IIIlllllllllllll S. purpuratus testis H2b-2
gacactggcatctccagccgtggaatgtctgtgatgaacagcttcgt 70 75 80 85
FIG. 1. Sixteen-bp match between the first CDR of human antibody heavy chain CE-I and a sea-urchin (S. purpuratus) testis histone H2b-2 nucleotide sequences. Numbers indicate amino acid positions.
MATCHING NUCLEOTIDE SEQUENCES
233
adjacent regions. The same reading frame was used for both, giving an amino acid sequence Arg Giy Met Ser Val, without using the last nucleotide a. A 15 consecutive bp sequence from the second CDR of CE-1 (Takahashi etal., 1984), acatctctggagact, coding for amino acid residues 61-65, was also found in a non-coding region of cauliflower mosaic virus (Gardner et al., 1981), 21 nucleotides upstream from the beginning of the coding region for its inclusion body protein. It might be part of the promotor region of that protein. Another 15 consecutive bp segment was from the third CDR of ND (Seno etal., 1983), gagtgattattataa, coding for amino acid residues 99-100D and was also present in an intron between exons 1 and 2 of human factor IX (Yoshitake et al., 1985), adjacent to an Alu-repeat. On the contrary, the eighty-four 15 consecutive bp segments from framework regions of human antibodies CE-1 (Takahasi et al., 1984) and ND (Seno et al., 1983) matched no unrelated sequences. As expected, due to the large collection of antibody sequences in the GenBank database (1987), they matched only with sequence from the framework regions of other antibodies. Discussion
Matches of 15 consecutive bp sequences in our study clearly distinguish the CDRs and framework regions of human antibody heavy chains. For CDRs, four matches were found with unrelated sequences, while there were no matches with related sequences. On the other hand, framework region sequences only matched with other antibody framework sequences. The database in GenBank (1987) contained about 1 x 107 bp, with many sequences from related genes. We have selected a segment of 15 consecutive bp for our matching study since it should occur on random bases about once in 4 ~5 times or about one in 1 x 109. Therefore, if we have searched about 109/107 or 100 such sequences, we should find one match in a random collection of 1 x 107 bp. Since the GenBank database (1987) was not random, the chance of finding 15-bp matches between unrelated sequences would be much less than one, probably of the order of 1/101/100. Matches between related sequences could be more due to the presence of many similar sequences. For framework region sequences of human antibody heavy chains, the matches were indeed as expected, i.e. no matches with unrelated sequences and several matches with other antibody framework region sequences. We have used eighty-four 15 consecutive bp sequences, which was close to 100. On the contrary, searching eighty-four 15 consecutive bp sequences from CDRs of human antibody heavy chains gave a completely unexpected result. Four matches were found with unrelated sequences and none with related sequences. There are several possible explanations. If we had sequenced the entire human genome of 3 x 109 bp, then on random bases, we should find three matches for any 15 consecutive bp. Thus, the match of 15 bp in the third CDR of human antibody heavy chain ND (Seno et al., 1983) on chromosome 14 and the 15 bp in the intron between exons 1 and 2 of human factor
234
T.T.
WU
IX (Yoshitake et al., 1985) on c h r o m o s o m e X may be random. However, the h u m a n genome is again not random, containing m a n y related sequences. Matches will be biased towards related sequences. Furthermore, only a very small fraction of the human genome has been sequenced, making such r a n d o m matches extremely unlikely. Another possibility is that though we have found a match of 16 consecutive bp from the first C D R of h u m a n antibody heavy chain CE-1 (Takahashi et al., 1984) with the coding region of a sea-urchin testis histone H2b-2 (Lieber etal., 1986), humans may have a similar histone gene with similar sequence. The match can be a random one between 16 bp on c h r o m o s o m e 7 and 1 6 b p on c h r o m o s o m e 14. However, the chance of finding a random 16bp match is about one in 4 x 109. In the relatively small G e n B a n k database (1987) of about 1 x 107, this is also very unlikely. Finally, there is the possibility that the h u m a n antibody heavy chain C D R sequences are derived from many different gene segments on the h u m a n chromosome. As a result, matches can be found between these sequences and completely unrelated sequences. However, the mechanism of incorporating such sequences into C D R s is totally unknown. To summarize, it is rather unexpected to find two 15 bp and one 16 bp segments from h u m a n antibody heavy chain C D R s which matched with unrelated sequences. The present search of relatively few 15 bp segments in h u m a n antibody heavy chain genes, primarily due to limited available nucleotide sequences, serves as an indication that more extensive studies will be required to understand why such matches can be found. H u m a n antibody light chain genes can be analyzed similarly. Hopefully, these results may provide some clue as to the origin of C D R s in humans. The research was supported by a grant from the Leukemia Research Foundation of Chicago. Work with the PROPHET computer system was supported by National Institutes of Health Contract NO12-RR-8-2118.
REFERENCES FITCH, W. M. & MARGOLIASH, E. (1970). EvoL Biol. 4, 67. GALl_Y, J. A. & EDEI,MAN, G. M. (1972). Ann. Rev. Genet. 6, 1. GARDNER, R. C., HOWARTH, A. J., HAHN, P., BROWN-LUEDI, M., SHEPHERD, R. J. & MESSING, J. (1981). Nucleic Acids Res. 9, 2871. GENBANK RELEASE 48, 16 February (1987). Los Alamos National Laboratory, Los Alamos, NM and Bolt Beranek and Newman Inc., Cambridge, MA. KUROSAWA, Y. & TONEGAWA, S. (1982). 3". exp. Med. 155, 201. LIEBER, T., WEISSER, K. & CHILDS, G. (1986). Mol. celL BioL 6, 2602. MILSTEIN, C. (1985). E M B O Z 4, 1083. REYNOLDS, C.-A., ANQUEZ, V., GRIMAL, H. & WEll_L, J.-C. (1987). Cell 48, 379. SENO, M. KUROKAWA, T., ONO, Y., ONDA, H., SASADA, R., IGARASHI, K., KIKU(?HI, M., SUGINO, Y., NISHIDA, Y. & HONJO, T. (1983). Nucleic Acids Res. !1,719. TAKAHASHI, N., NOMA, T. & HONJO, T. (1984). Proc. Natn. Acad. Sci. U.S.A. 81, 5194. Wu, T. T. & KABAT, E. A. (1970)..L exp. Med. 132, 211. Wu, T. T. & KABAT, E. A. (1982). Proc. natn. Acad. Sci. U.S.A. 79, 5031. YOSHITAKE, S., SCHACH, B. G., FOSTER, D. C., DAVIE, E. W. & KURACHI, K. (~985). Biochemistry 24, 3736.